x

Please fill out this short user survey of only 3 questions in order to help us improve the site. We appreciate your feedback!
Performance pay for introspection
Last registered on March 29, 2021

Pre-Trial

Trial Information
General Information
Title
Performance pay for introspection
RCT ID
AEARCTR-0007425
Initial registration date
March 27, 2021
Last updated
March 29, 2021 11:00 AM EDT
Location(s)

This section is unavailable to the public. Use the button below to request access to this information.

Request Information
Primary Investigator
Affiliation
Harvard University
Other Primary Investigator(s)
Additional Trial Information
Status
In development
Start date
2021-03-30
End date
2021-11-30
Secondary IDs
Abstract
We examine the effects of different incentive schemes on effort on a subjective classification task on the labor market platform MTurk.
Subjective classification tasks are common strategies to gather information on tastes, attitudes, and develop training datasets for artificial intelligence. When the output of the task is subjective, designing effective incentive schemes is challenging because effort is difficult to observe. However, opinions are often given freely, suggesting that the cost of providing a subjective opinion is low (or even negative), so that incentivizing effort may be ineffective or even counterproductive if monitoring and incentives either crowd out altruistic motives or generate multi-tasking problems and potentially incentivize gaming.

This pilot broadly investigates how to incentivize respondents to perform a simple introspective task: classifying responses to an open-ended question according to “originality”. MTurk workers are asked to report both their first order beliefs (what they think) and their second order beliefs (what they think others think). Using several measures of respondent effort, we will compare the performance of workers under a range of different incentive schemes including fixed wage schemes, various forms of attention checks, and the Bayesian Truth Serum (BTS).

For each incentive scheme, we will examine the level of effort using a range of novel outcomes, including an incentivized measure of the degree of disutility associated with the task. We will use these measures to examine whether performance incentives for subjective tasks increases the level of effort, whether linking performance pay to particular sub-tasks creates crowds out effort on other tasks, and whether the effects of different incentive schemes are heterogeneous across people with different predispositions to exert effort when there are no performance incentives.
External Link(s)
Registration Citation
Citation
Gray-Lobe, Guthrie. 2021. "Performance pay for introspection." AEA RCT Registry. March 29. https://doi.org/10.1257/rct.7425-1.0.
Experimental Details
Interventions
Intervention(s)
We examine the effects of different incentive schemes on effort on a subjective classification task on the labor market platform MTurk.
Subjective classification tasks are common strategies to gather information on tastes, attitudes, and develop training datasets for artificial intelligence. When the output of the task is subjective, designing effective incentive schemes is challenging because effort is difficult to observe. However, opinions are often given freely, suggesting that the cost of providing a subjective opinion is low (or even negative), so that incentivizing effort may be ineffective or even counterproductive if monitoring and incentives either crowd out altruistic motives or generate multi-tasking problems and potentially incentivize gaming.

This pilot broadly investigates how to incentivize respondents to perform a simple introspective task: classifying responses to an open-ended question according to “originality”. MTurk workers are asked to report both their first order beliefs (what they think) and their second order beliefs (what they think others think). Using several measures of respondent effort, we will compare the performance of workers under a range of different incentive schemes including fixed wage schemes, various forms of attention checks, and the Bayesian Truth Serum (BTS).

For each incentive scheme, we will examine the level of effort using a range of novel outcomes, including an incentivized measure of the degree of disutility associated with the task. We will use these measures to examine whether performance incentives for subjective tasks increases the level of effort, whether linking performance pay to particular sub-tasks creates crowds out effort on other tasks, and whether the effects of different incentive schemes are heterogeneous across people with different predispositions to exert effort when there are no performance incentives.
Intervention Start Date
2021-04-01
Intervention End Date
2021-04-27
Primary Outcomes
Primary Outcomes (end points)
Time on task
Internal consistency (share of repeated items classified similarly)
Group consistency (average absolute difference between report and mean group report)
Negative reservation base wage (exact amount of bid)
Negative payout adjusted reservation base wage (bid – an adjustment for the amount the worker was paid out)

We will define gaming behaviors for the internal consistency strategy as:
The maximum 1st order belief option share (e.g., out of three options, the share of the option that is used most frequently)
Negative similarity of individual responses to other workers
Negative time on task
Degenerate second order beliefs (e.g., reporting 100 percent in the second order beliefs elicitation module)
Primary Outcomes (explanation)
Time on task, internal consistency and group consistency will be measured for the full task (first and second order beliefs) and separately for first order and second order beliefs. We will refer to a generic measure of effort as e, and effort on first (second) order beliefs as e_1 (e_2).
Secondary Outcomes
Secondary Outcomes (end points)
Secondary Outcomes (explanation)
Experimental Design
Experimental Design
The data for this study come from an mTurk subjective classification task. Workers were asked to classify reports from Kenyan pupils on possible uses of a spoon in terms of the degree of “originality”. Workers were asked to report their first order beliefs about whether the use was “original” (three-point scale), and their second order beliefs about the percentage of other mTurkers who report that the use was “original”. Each round, workers are asked to grade 100 items. Of the 100 items, 20 will repeat (10 unique items x 2 occurrences). Workers complete 2-3 100-item rounds.
Treatments vary how the task is incentivized. There are six treatment arms:
TA: Control (N=200) – fixed payment
TB: Intrinsic motivation (N=160) – fixed payment
TC: Internal consistency (N=160) - Respondents receive a bonus from classifying repeated items the same way.
TD: Group consistency (N=160) - Workers receive an incentive based on the similarity of their second order beliefs to others’ reported first order beliefs
TE: Bayesian Truth Serum (BTS) (N=160)- Workers are incentivized using the BTS mechanism
TF: Attention checks (N=160) - Workers are incentivized to pay attention by giving a pre-specified answer for items containing a pre-specified word.
Workers first complete a practice round of 10 items. They then complete round 1 with 100 items, all unincentivized.
Before round 2, workers are randomized into treatments and then complete another 100 items.
After completing the second round, workers participate in a Becker-Degroot-Marshalk auction to participate in a third round. Workers submit the lowest base wage for which they would agree to work another round under the same incentive scheme. Workers who bid an amount below a randomized offer are required to participate in the third round with the base wage being the randomized offer.
The base wage will be US$ 6, the total bonus will be up to US$ 4.
The MDE for a comparison between any two treatment arms will be 0.29 SDs (alpha 0.05, beta = 0.80).
Experimental Design Details
Not available
Randomization Method
By computer.
Randomization Unit
Individual
Was the treatment clustered?
No
Experiment Characteristics
Sample size: planned number of clusters
1000
Sample size: planned number of observations
1000
Sample size (or number of clusters) by treatment arms
TA: Control (N=200) – fixed payment
TB: Intrinsic motivation (N=160) – fixed payment
TC: Internal consistency (N=160) - Respondents receive a bonus from classifying repeated items the same way.
TD: Group consistency (N=160) - Workers receive an incentive based on the similarity of their second order beliefs to others’ reported first order beliefs
TE: Bayesian Truth Serum (BTS) (N=160)- Workers are incentivized using the BTS mechanism
TF: Attention checks (N=160) - Workers are incentivized to pay attention by giving a pre-specified answer for items containing a pre-specified word.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
0.29 SDs (alpha 0.05, beta = 0.80).
IRB
INSTITUTIONAL REVIEW BOARDS (IRBs)
IRB Name
Innovations for Poverty Action
IRB Approval Date
2020-06-27
IRB Approval Number
7401
Analysis Plan

There are documents in this trial unavailable to the public. Use the button below to request access to this information.

Request Information