Accuracy and Updating: How Labels and Initial Performance Affect Teacher Grading
Last registered on August 20, 2016

Pre-Trial

Trial Information
General Information
Title
Accuracy and Updating: How Labels and Initial Performance Affect Teacher Grading
RCT ID
AEARCTR-0001498
Initial registration date
August 20, 2016
Last updated
August 20, 2016 11:03 AM EDT
Location(s)
Region
Primary Investigator
Affiliation
Stanford University
Other Primary Investigator(s)
PI Affiliation
Henan University
PI Affiliation
Stanford University
PI Affiliation
Peking University
PI Affiliation
Stanford University
Additional Trial Information
Status
Completed
Start date
2016-05-16
End date
2016-06-03
Secondary IDs
N/A
Abstract
Accurate grades help students, parents, teachers, and administrators allocate resources to improve student learning. Moreover, many life-altering institutional decisions are based on grades, such as whether or not students are given opportunities to pursue high performance tracks. Unfortunately, recent studies have shown that inaccuracies in grading occur when teachers rely on student labels (such as ethnicity or caste) to inform their grading. Moreover, studies have shown large inaccuracies in grading occur when teachers rely on labels about a student’s prior grades to assign grades.

With the above in mind, we have four goals in mind for this study. First, we seek to validate prior studies by examining how teacher assessments are affected by information about a student's prior performance. Specifically, we look at how labels about the prior performance of a student affect teacher grading.

Second and more importantly, we seek to examine how the effect of such labels changes as teachers grade additional work from a student. On the one hand, teachers may get more accurate when grading a given student’s work over time, in spite of labels. The hypothesis is that labels cause teachers to assign inaccurate grades at first, but these inaccuracies resolve as the teacher interacts with and becomes familiar with students over time. On the other hand, any inaccuracies could also compound over time, leading to cycles of cumulative (dis)advantage. This may occur if teachers assign inaccurate grades in the past and proceed to rely heavily on prior grades when assigning future grades. This may create a positive feedback cycle where minor inaccuracies amplify or become entrenched over time. In other words, we should be more worried about inaccurate grading in practice.

Third, it is unclear if the actual initial performance of a student also leads to inaccuracies in grading. Labels are social classifications imposed upon a student, whereas initial performance comes from an individual student him or herself. For example, at the start of a school year, a teacher will grade the first assignments of students and give them a grade. Will the teacher’s first grades create inaccuracies in how the teacher grades later assignments? Is the effect cumulative?

Fourth and finally, it is unclear how confirmatory or contradictory evidence changes inaccuracies in grading. For instance, if a student is labelled as a poor performer, but he or she displays high initial performance, how does the teacher respond?
External Link(s)
Registration Citation
Citation
Chu, James et al. 2016. "Accuracy and Updating: How Labels and Initial Performance Affect Teacher Grading." AEA RCT Registry. August 20. https://www.socialscienceregistry.org/trials/1498/history/10282
Experimental Details
Interventions
Intervention(s)
Each teacher receives the exact same set of four essays (written by actual seventh and eighth graders). The first (1) essay is graded blindly. After the teacher grades the essay, he or she is told that this is a “practice” essay. The teacher then receives a set of three essays ostensibly written by a particular student at the teacher’s school over the course of a semester (e.g. three essays written at three distinct time periods).

Teachers are randomly assigned to receive one of six different essay packets in a 3x2 crosscut design. These groups are defined as follows:

A. Essay #2 includes a description of the student as having prior performance in top 25 percentile
B. Essay #2 includes a description of the student as having prior performance in bottom 25 percentile
C. No label about student

X. When graded blindly in the past, essay #2 was of high quality (graded at top 25 percentile)
Y. When graded blindly in the past, essay #2 was of low quality (graded at bottom 25 percentile)

In regard to interventions A, B, and C above, we first reminded the teacher that we had assessed a large number of students at the teachers’ school (this was true, as we surveyed the students as a part of another experiment). For treatment conditions A and B, we wrote the following description on the top of essay 2: “In the previous semester we conducted a Chinese language arts test at your school. This student performed at the top [bottom] 25th percentile (treatment A [B] in Table 1 above) in your school.” Teachers in the no prior grade label group (treatment C) did not see this information.

With regard to treatment groups X and Y above, we selected high and low quality essays from a bank of essays that were graded by approximately 200 teachers in a prior pilot study. For the purposes of this experimental study, an essay that received a grade in the bottom 25th percentile of essay grades in the pilot study was considered as a “low quality” essay. An essay that received a grade in the top 25th percentile of essay grades in the pilot study was considered a “high quality” essay for the purposes of this experimental study.

Finally, essays 3 and 4 were identical across groups. No additional labels were added to Essays 3 and 4.
Intervention Start Date
2016-05-16
Intervention End Date
2016-06-03
Primary Outcomes
Primary Outcomes (end points)
The key dependent variable in this analysis will be the total essay score given by a teacher for essays 2, 3, and/or 4 (or equivalently on periods 2, 3, and/or 4).

Subcomponents of this score correspond to essay content, grammar, or style. These will not be main outcomes and the analyses for these non-main outcomes will be exploratory.

An additional exploratory outcome will be the standard deviation of the grading distribution for essays 2, 3, and/or 4. By comparing this across groups, we should be able to explore how much consensus in grading there is for a given essay.

Primary Outcomes (explanation)
This measure is out of 100 points, but for ease of comparison we will standardize this measure in advance (subtracting the mean and dividing by standard deviation).
Secondary Outcomes
Secondary Outcomes (end points)
Secondary Outcomes (explanation)
Experimental Design
Experimental Design
In each sample school, enumerators found Chinese language arts teachers in seventh and eighth grade (sample teachers). These teachers are asked to help grade a set of essays. However, teachers receive different essay packets depending on their treatment group. Teachers are randomly assigned to one of six treatment groups:

AX. Student top 25 percentile, essay of high quality
BX. Student bottom 25 percentile, essay of high quality
CX. No label, essay of high quality

AX. Student top 25 percentile, essay of low quality
BX. Student bottom 25 percentile, essay of low quality
CX. No label, essay of low quality

Experimental Design Details
Randomization Method
Randomization done in office by a computer; randomization code available upon request.
Randomization Unit

Each teacher had an equal chance of being assigned to one of six treatment groups. In other words, randomization was simple and was not blocked (e.g. by school).
Was the treatment clustered?
No
Experiment Characteristics
Sample size: planned number of clusters
300 schools
Sample size: planned number of observations
900 teachers
Sample size (or number of clusters) by treatment arms
Group AX: 150
Group BX: 150
Group CX: 150
Group AY: 150
Group BY: 150
Group CY: 150

For description of groups, see intervention section above.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Power calculations were conducted before the beginning of the trial using Optimal Design software (Spybrook et al. 2009). Based on our previous study of inaccuracies in teacher grading in rural China, we assume R2=0.4 (the correlation between the baseline covariates and the outcome, squared). With this assumption and for a 5% significance level (alpha = 0.05) and 80% power (beta = 0.8), the “Randomized Trials” option in the Optimal Design software suggests that we will require at least 560 teachers (split between two treatment arms) to detect an effect size of 0.18 SDs and 280 teachers (split between two treatment arms) to detect an effect size of 0.26 SDs. The above power calculations do not take into account testing multiple hypotheses (comparisons). In particular, during our exploratory analyses, we will test multiple hypotheses. Thus, for tests of average treatment effects, we will report the standard p-value for each test as well as the p-value adjusted for multiple tests (controlling the False Discovery Rate – see Anderson 2008). Tests of heterogeneous treatment effects and mechanisms will each be treated as independent, exploratory hypotheses (and will not be adjusted for multiple hypothesis testing).
IRB
INSTITUTIONAL REVIEW BOARDS (IRBs)
IRB Name
Stanford University Human Subjects Research Board
IRB Approval Date
2016-05-04
IRB Approval Number
32605
Analysis Plan
Analysis Plan Documents
Pre-Analysis Plan Accuracy and Updating in Teacher Grading 3.2.docx

MD5: c7fcf524f0b374d0b004602c33b9ed4d

SHA1: 24583a6476f8d5a08fe343bdff151a99d07e8a38

Uploaded At: August 19, 2016

Post-Trial
Post Trial Information
Study Withdrawal
Intervention
Is the intervention completed?
Yes
Intervention Completion Date
June 03, 2016, 12:00 AM +00:00
Is data collection complete?
Yes
Data Collection Completion Date
June 03, 2016, 12:00 AM +00:00
Final Sample Size: Number of Clusters (Unit of Randomization)
298 schools
Was attrition correlated with treatment status?
No
Final Sample Size: Total Number of Observations
840 teachers
Final Sample Size (or Number of Clusters) by Treatment Arms
Group AX: 139 Group BX: 139 Group CX: 136 Group AY: 141 Group BY: 140 Group CY: 145 For description of groups, see intervention section above.
Data Publication
Data Publication
Is public data available?
No

This section is unavailable to the public. Use the button below to request access to this information.

Request Information
Program Files
Program Files
No
Reports and Papers
Preliminary Reports
Relevant Papers