Heterogeneity of Experimental Findings: Evidence from Real-Effort Tasks
Last registered on May 10, 2019


Trial Information
General Information
Heterogeneity of Experimental Findings: Evidence from Real-Effort Tasks
Initial registration date
May 24, 2018
Last updated
May 10, 2019 4:52 PM EDT
Primary Investigator
UC Berkeley
Other Primary Investigator(s)
PI Affiliation
University of Chicago
Additional Trial Information
Start date
End date
Secondary IDs
Replication of experimental results is a focus of the recent literature. More broadly, it is important to know how experimental results vary when the context, or design, changes, that is, how heterogeneous the findings are. For example, how sensitive experimental results are to the demographics of the sample? How do they depend on the output measure used? How critical is the experimental protocol? We consider these questions in the context of a real effort task. We run variants of the same experimental treatments, varying one-by-one the demographic group considered, the task used, the measure of effort, and whether subjects know that the task is an experiment. We examine how much the experimental results change in response to these changes, compared to a pure replication.
External Link(s)
Registration Citation
DellaVigna, Stefano and Devin Pope. 2019. "Heterogeneity of Experimental Findings: Evidence from Real-Effort Tasks." AEA RCT Registry. May 10. https://www.socialscienceregistry.org/trials/2987/history/46273
Experimental Details
The interventions are 16 different treatments altering the motivation for effort and are borrowed from DellaVigna and Pope (REStud 2018). The emphasis of the paper is how the impact of the different effort motivators varies when the design of the experiment is changed with different versions.
Intervention Start Date
Intervention End Date
Primary Outcomes
Primary Outcomes (end points)
The key outcome variable is a measure of average effort. The measure of effort differs across the four different versions of the experiment. In versions 1 it is the number of points scored by the subject in 10 minutes, where each point is scored as a result of pressing ‘a’ then ‘b’. In versions 2 it is the number of WWII cards coded by the subject in 10 minutes. In versions 3 and 4 it is the number of extra WWII cards coded by the participants beyond the required 40 cards. Within each version and within each of the 16 treatments that we run in each version, we compute the average of the measure of effort across the subjects in that version-treatment cell.
Primary Outcomes (explanation)
Secondary Outcomes
Secondary Outcomes (end points)
Secondary Outcomes (explanation)
Experimental Design
Experimental Design
Subjects will choose to participate in this study by selecting it on Amazon's Mechanical Turk service. Before the participants choose to participate, they will be provided with a brief description of the study (a” typing task”) which also tells a guaranteed flat-pay for successful submission ($1) and a time estimate for completion. Once in the task, participants are assigned to one of four versions, and within each version they are directed to complete a task with randomly-selected incentive structures.

Version 1 is a direct replication of the task used by DellaVigna and Pope (REStud 2018): participants press two alternating keyboard buttons ('a' and 'b') within a given time period (10 minutes). There are 16 different randomly-assigned treatments with varying levels and types of incentives. For example, some participants will be given bonus payments based on points scored, others will raise money for charity, and some will receive no bonus. All participants are clearly informed of their incentive structure and bonus opportunities before playing their task. There is no deception at any point in this task.

Version 2 is similar to Version 1 except that the task consists of coding the occupation from World War II historical cards for 10 minutes. As in version 1, the participants are randomized in one of 16 treatments with varying levels and types of incentives.

Version 3 involves coding of WWII cards, as in Version 2. However, this time all subjects are required to code 40 cards as their main task. After the coding of the 40 cards is completed, subjects are randomized into 16 different treatments, parallel to the ones used in Version 1 and 2. The treatments consist of different incentives offered to the participants in case they agree to code additional WWII cards, up to 20 extra cards.

Version 4 is the same as version 3, except that, unlike in Versions 1-3, there is no consent form prior to participating in the tasks, so subjects do not know that the task is an experiment. (Notice that the participants are coding historical data as part of an economic history project, so this is a data coding assignment.)

Upon completion of the task, participants will be thanked for their contribution and the flat payment of $1, along with any additional money won, will be paid within the designated time period for their treatment (typically within 24 hours).

The final sample will exclude subjects that:
(1) do not complete the MTurk task within 30 minutes of starting;
(2) exit and then re-enter the task as a new subject (as these individuals might see multiple treatments);
(3) are not approved for any other reason (e.g. they did not having a valid MTurk ID);
(4) In version 1 (a-b typing) do not complete a single effort unit; there is no need for a parallel requirement for version 2 since the participants have to code a first card to start the task;
(5) in version 1 scored 4000 or more a-b points (since this would indicate cheating);
(6) in version 2 coded 120 or more cards with accuracy below 50% (since this would indicate cheating);
(7) in versions 3 and 4 completed the 40 required cards in less than 3 minutes with accuracy below 50%, or completed the 20 additional cards in less than 1.5 minutes with accuracy below 50% (since this would indicate cheating).

Restrictions (1)–(5) are exactly the same as in the DellaVigna-Pope (REStud) experiment; restrictions (6)-(7) are meant to parallel restriction (5) for the new task. We expect that restrictions (3)-(7) are likely to be binding in only a small number of cases. For example, in a pilot of 400 subjects we did not have anyone in categories (3)-(7).

The average score in each of these treatments and versions will form the basis for the elicitation of expert forecasts.
Experimental Design Details
Randomization Method
The assignment of subjects into versions and treatments is determined as follows. The subjects are assigned into one of the four versions randomly, with versions 2, 3, and 4 oversampled by 15 percent. This is in anticipation of the fact that the historical task used in version 2-4 will likely have a higher share of subjects not complete the task due to, for example, difficulty in reading cursive writing (employed in these cards). These subjects will thus not pass the sample restrictions above. In pilot data, we observed higher attrition in these versions by about 15 percent. The overweighting is designed to equate as much as possible the post-attrition sample size across the four versions. Within a version, we randomize participants into one of the 16 treatments with equal weights.
Randomization Unit
Individual subject.
Was the treatment clustered?
Experiment Characteristics
Sample size: planned number of clusters
Please see below. The number of clusters is the same as the number of observations.
Sample size: planned number of observations
The ideal number of subjects planned for the study is 10,000 people completing the tasks. We advertise for 10,500 subjects to take into account subjects excluded due to restrictions (2)-(7) above. (MTurk would count these subjects are successful completions.) The task will be kept open on Amazon Mechanical Turk until either (i) three weeks have passed or (ii) 10,500 subjects have completed the study, whichever comes first.
Sample size (or number of clusters) by treatment arms
At the ideal sample size of 10,000, each treatment-version combination would have approximately 150 subjects.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
For version 1, we can rely on the pilot study for DellaVigna and Pope (REStud); that experiment was highly powered for nearly all the treatments, thus it is well-powered still with a smaller sample of about 150 subjects per treatment, compared to 500 in the previous study. Version 2 should have a similar statistical power, and Versions 3 and 4 should have higher statistical power based on model-base simulations and on a small pilot.
IRB Name
Social and Behavioral Sciences Institutional Review Board (SRS-IRB) at the University of Chicago, with UC Berkeley relying on the UChicago IRB.
IRB Approval Date
IRB Approval Number
Analysis Plan
Analysis Plan Documents

MD5: f4ca33d1f85849c3f074922bcc33dc18

SHA1: de6d476dc77352b99f67364e48a52848c27b79e8

Uploaded At: May 24, 2018


MD5: b9fcc4ff32d06173c39b931c0c71b4ad

SHA1: 4116d00b4cac2e6e2bb1a5d820f1ac84179e8bac

Uploaded At: July 23, 2018

Post Trial Information
Study Withdrawal
Is the intervention completed?
Is data collection complete?
Data Publication
Data Publication
Is public data available?
Program Files
Program Files
Reports and Papers
Preliminary Reports
Relevant Papers