Measuring Success in Education: The Role of Effort on the Test Itself

Standardized tests comparing educational achievements are an important policy tool. U.S. students often rank poorly on such assessments. We propose that this is due not only to differences in ability but also to differences in effort on the test itself. We experimentally show that offering U.S. students incentives to put forth effort improves test performance substantially. In contrast, Shanghai students, who are top performers on assessments, are not affected by incentives. Our findings suggest that ranking countries based on low-stakes assessments does not reflect only differences in ability, but also intrinsic motivation to perform well on the test.

External Link(s)

Registration Citation

Citation

Gneezy, Uri et al. 2018. "Measuring Success in Education: The Role of Effort on the Test Itself." AEA RCT Registry. December 13. https://doi.org/10.1257/rct.3657-1.0

Former Citation

Gneezy, Uri et al. 2018. "Measuring Success in Education: The Role of Effort on the Test Itself." AEA RCT Registry. December 13. https://www.socialscienceregistry.org/trials/3657/history/38875

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

10th grade students at two high schools in the U.S.A. and four high schools in Shanghai take a 25 question mathematics test which is made up of multiple choice and free answer questions that were given on past editions of the Programme for International Student Assessment (PISA). Students are randomized into either treatment or control. Members of the treatment group is given a financial incentive that is based on their test performance. Right before taking the test, they are given an envelope with $25 in cash (or the equivalent in RMB in Shanghai), and are told that $1 will be taken away for each question that is answered incorrectly. The control group takes the test with no financial incentives.

The test is taken and graded immediately upon completion by computer, so the payments of subjects in the treatment group are processed immediately at the conclusion of the experiment.

Intervention Start Date

2016-03-25

Intervention End Date

2018-04-26

Primary Outcomes

Primary Outcomes (end points)

The study examines four outcome variables, all related to performance on the test:
Student level outcomes (N = 447 in U.S., N = 656 in Shanghai)
1. number of questions answered correctly (out of 25), standardized by subtracting the sample mean and dividing by the sample standard deviation.

Question response level outcomes (N = 447 students x 25 questions = 11,175 observations in U.S., N = 656 students x 25 questions = 16,400 observations in Shanghai)
2. The probability that question i is attempted by student j:
a) over all 25 questions
b) over questions 1-13
c) over questions 14-25

3. The probability that student j answered question i correctly (sample of attempted questions only)
a) over all 25 questions
b) over questions 1-13
c) over questions 14-25

4. The probability that student j answered question i correctly (all questions, both attempted and not attempted)
a) over all 25 questions
b) over questions 1-13
c) over questions 14-25

Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)

Secondary Outcomes (explanation)

Experimental Design

The working paper linked to this registration provides full details of the experimental design (NBER working paper No. 24004). Key details are summarized below.

We recruited two two high schools in the U.S.A. and four high schools in Shanghai to participate. In each school, student subjects take a 25 minute, 25 question mathematics test which is made up of multiple choice and free answer questions that were given on past editions of the Programme for International Student Assessment (PISA). Students are randomized into either treatment or control. Members of the treatment group are given a financial incentive that is based on their test performance. Right before taking the test, they are given an envelope with $25 in cash (or the equivalent in RMB in Shanghai), and are told that $1 will be taken away for each question that is answered incorrectly. The students had no advance notice of the task they would be doing or of the financial incentives. The control group takes the test with no financial incentives.

The exam is administered by computer so scores are available immediately after the test is completed.

The main experiment was conducted in 2016 at two high schools in the U.S. and at three schools in Shanghai.
U.S. School 1 is a high performing private boarding school.
U.S. School 2 is a low performing public school.
At both of these schools, all 10th grade students were required to participate.

Shanghai schools 1 through 3 include one below-average performing school, one school with performance that is just above average, and one
school with performance that is well above average. Two classes each of 10th grade math students at schools 1 and 2 and four classes of 10th grade math students at school 3 were randomly selected to participate.

in 2018, we reran the experiment in Shanghai at schools 2 and 3 and at a new school, school 4, whose performance is also well above average.

Logistics required different randomization procedures of students into treatment or control (described in more detail below):
U.S. School 1: students were randomized into treatment or control at the individual level.
U.S. School 2: students were randomized into treatment or control at the class level.
Shanghai schools 1-3 in 2016 were randomized into treatment or control at the class level.
Shanghai schools 2-4 in 2018 were randomized into treatment or control at the individual level.

Experimental Design Details

Randomization Method

The randomization is stratified by school. Logistics required different randomization procedures of students into treatment or control.

We randomized at the class level in the lower performing school (school 2) in the U.S. and in the 2016 sessions in Shanghai. We randomized at the individual level in the higher performing school in the U.S. and in the 2018 sessions in Shanghai. In the U.S., we stratified by school and re-randomized to achieve balance on the following baseline characteristics: gender, ethnicity and mathematics class level/track: low, regular, and honors. For each school's randomization, we re-randomized until the p-values of all tests of differences between Treatment and Control were above 0.4. In the 2016 Shanghai sessions, we stratified the randomization by school (baseline demographics were not available at the time of randomization). In the 2018 Shanghai sessions, we stratified the randomization by class, gender, and senior entrance exam score quartile.

Randomization Unit

The randomization is stratified by school. Within each school the level of randomization varied because of logistical constraints.
U.S. School 1: students were randomized into treatment or control at the individual level.
U.S. School 2: students were randomized into treatment or control at the class level.
At Shanghai schools 1-3 in 2016, students were randomized into treatment or control at the class level.
At Shanghai schools 2-4 in 2018, students were randomized into treatment or control at the individual level.

Was the treatment clustered?

Yes

Experiment Characteristics

Sample size: planned number of clusters

U.S.: 131 clusters. In U.S. school 2, the randomization was done at the individual level so the size of each cluster is 1 for that school.
Shanghai: 384 clusters. In the 2018 sessions, the randomization was done at the individual level so the size of each cluster is 1 for those sessions.

Sample size: planned number of observations

The U.S. sample includes 447 students (227 in control and 220 in treatment). The Shanghai sample includes 656 students (333 in control and 323 in treatment).

Sample size (or number of clusters) by treatment arms

U.S.:
227 students organized into 64 clusters (12 of size n>1, 52 of size n=1) in control.
220 students organized into 67 clusters (13 of size n>1, 54 of size n=1) in treatment.

Shanghai:
333 students organized into 196 clusters (4 of size n>1, 190 of size n=1) in control
323 students organized into 188 clusters (4 of size n>1, 184 of size n=1) in treatment

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

U.S. For the main outcome, test score (out of 25), the control mean is 10.22 with a standard deviation of 5.63. With 131 clusters of size 3 (the actual average size is 3.41 and clusters vary widely in size), with power of 0.8 and at a significance level of 0.05, the minimum detectable effect size is 1.13. With 131 clusters of size 4 (the actual average size is 3.41 and clusters vary widely in size), with power of 0.8 and at a significance level of 0.05, the minimum detectable effect size is 0.98. Shanghai For the main outcome, test score (out of 25), the control mean is 20.50 with a standard deviation of 2.95. With 384 clusters of size 2 (the actual average size is 1.71; most clusters are size 1 with four clusters that are much larger), with power of 0.8 and at a significance level of 0.05, the minimum detectable effect size is 1.05. With 131 clusters of size 1 (the actual average size is 1.71; most clusters are size 1 with four clusters that are much larger), with power of 0.8 and at a significance level of 0.05, the minimum detectable effect size is 1.00

Supporting Documents and Materials

IRB

Institutional Review Boards (IRBs)

IRB Name

University of Chicago IRB

IRB Approval Date

2015-05-27

IRB Approval Number

IRB15-0448

Analysis Plan

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?

Yes

Intervention Completion Date

April 26, 2018, 12:00 +00:00

Data Collection Complete

Yes

Data Collection Completion Date

April 26, 2018, 12:00 +00:00

Final Sample Size: Number of Clusters (Unit of Randomization)

U.S.: 447 students, 131 clusters
Shanghai: 656 students, 384 clusters

Was attrition correlated with treatment status?

Final Sample Size: Total Number of Observations

U.S.: 447 students, 131 clusters
Shanghai: 656 students, 384 clusters

Final Sample Size (or Number of Clusters) by Treatment Arms

U.S.: 227 students organized into 64 clusters (12 of size n>1, 52 of size n=1) in control. 220 students organized into 67 clusters (13 of size n>1, 54 of size n=1) in treatment. Shanghai: 333 students organized into 196 clusters (4 of size n>1, 190 of size n=1) in control 323 students organized into 188 clusters (4 of size n>1, 184 of size n=1) in treatment

Data Publication

Is public data available?

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Program Files

Reports, Papers & Other Materials

Relevant Paper(s)

Abstract

Tests measuring and comparing educational achievement are an important policy tool. We experimentally show that offering students extrinsic incentives to put forth effort on such achievement tests has differential effects across cultures. Offering incentives to U.S. students, who generally perform poorly on assessments, improved performance substantially. In contrast, Shanghai students, who are top performers on assessments, were not affected by incentives. Our findings suggest that in the absence of extrinsic incentives, ranking countries based on low-stakes assessments is problematic because test scores reflect differences in intrinsic motivation to perform well on the test itself, and not just differences in ability.

Citation

Uri Gneezy, John A. List, Jeffrey A. Livingston, Sally Sadoff, Xiangdong Qin, Yang Xu (2017) "Measuring Success in Education: The Role of Effort on the Test Itself." NBER Working Paper No. 24004

URL

https://www.nber.org/papers/w24004

Measuring Success in Education: The Role of Effort on the Test Itself

Pre-Trial

General Information

Locations

Primary Investigator

Other Primary Investigator(s)

Additional Trial Information