Peer Performance Evaluations and Gender Bias

Last registered on January 19, 2024


Trial Information

General Information

Peer Performance Evaluations and Gender Bias
Initial registration date
January 13, 2024

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
January 19, 2024, 2:02 PM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.


Primary Investigator

University of Florida

Other Primary Investigator(s)

PI Affiliation
University of Florida

Additional Trial Information

Start date
End date
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Performance evaluations are a primary determinant of hiring and promotion decisions, and peer evaluations have become increasingly widespread in the workplace. Identifying bias in these evaluations is essential but challenging primarily for two reasons: sorting of evaluators and candidates and lack of an objective performance measure to serve as a benchmark. We overcome these challenges in peer performance evaluations setting in a large, introductory course at a flagship public university. Peer evaluators were randomly assigned to score essays using a rubric producing two subscores: content and writing. Evaluators were incentivized to match official grades using the same rubric, adding a monitoring effect. We exploit the random assignments of both peer evaluators and blinded official graders over several essay assignments. With this, we are able to analyze whether students with female-sounding names receive lower scores compared to students without female-sounding names in content and/or writing subscores. We also test whether gender bias show any variation when students are evaluated by female/male peer.
External Link(s)

Registration Citation

Knight, Thomas and Perihan Saygin. 2024. "Peer Performance Evaluations and Gender Bias." AEA RCT Registry. January 19.
Experimental Details


We randomly assign students to grade their classmates' essay homework. These homework assignments are graded blindly by randomly assigned official graders using the same rubric. Peer evaluators are incentivized to match the official grades to ensure a thoughtful evaluation.
Intervention Start Date
Intervention End Date

Primary Outcomes

Primary Outcomes (end points)
Assignment grades assigned by randomly assigned peers
Assignment grades assigned by randomly assigned (blind) teaching assistants
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
We use data from a peer grading assessment which wwas part of the syllabus in an introductory macroeconomics course at a large, comprehensive research university in Fall 2018. Students completed four short essay assignments during the course, which were then evaluated by a randomly matched classmate. Each assignment was based on a clear prompt to which there were objectively correct answers. We compare the peer-assigned grades to official grades of the same assignment submissions and explore whether these two scores systematically differ. Official grades are determined by trained TAs who are randomly assigned to specific students and grade their assignments blindly. TA assignments were re-randomized for each of the four assignments.

All enrolled students were required to complete four short essay assignments. These essay assignments asked students to answer specific questions about an economic graph. There was a single objectively correct answer to each question. No outside research was required, and neither subjective analysis nor students’ own opinions were solicited. These questions were the types of questions that an instructor would typically include as a free-response question on an exam in a smaller course. Students were told that their submission should be composed in “essay form”, and that while economics content would play a much larger role in determining their grade, writing quality would also play a role. They were also told that there was no minimum or maximum required length, but that a strong answer could be provided in approximately 150 words.

Assignments were submitted into an electronic course management system (Canvas). Each submission was graded blindly by a randomly assigned trained TA who assigned the official score for inclusion in the student’s final course grade calculation, and by a randomly assigned (non-blind) peer grader for evaluation. There were 5 male and 4 female TAs. All TAs were economics graduate students. Both blind TAs and non-blind peer graders evaluated the assignments using a common scoring rubric. Rubrics contained two types of questions: between seven and eleven economic content questions, each of which had an objectively correct answer, and three writing quality questions. Each question on the rubric was awarded a numeric score between zero and ten. Partial credit was available for economic content questions if, for example, a student provided the correct answer but used incorrect units of measurement. Writing questions were scored on the same zero-to-ten scale. The grading procedure generated a content subscore (defined as the percentage of possible content points earned), a writing subscore (defined as the percentage of possible writing points earned), and an overall assignment grade (defined as the percentage of total possible points earned). Peer graders had one week to complete the peer review. Official TA grades were released after one week. Neither group had access to the other group's scores.

When completing their evaluations, both TAs and peer graders had access to the same information and scoring rubrics. The only difference was that peer graders could view the submission author's first and last name and possibly a small picture of the student’s face. The TAs graded blindly; they did not have access to the names or the pictures of the authors. Importantly, this experiment took place in a very large class. Most of students did not know each other. The majority of peer graders did not know the gender of the author. Instead, they could infer the author's gender from the name or possibly look at the picture of the student. Many students did not upload pictures into the system, and even when they did, these pictures were very small (i.e., approximately the size of a dime). Rather than using the students' actual gender, we use the gender predicted by their names. Using, we predict the gender probabilities for all first names. To be conservative, we used the following definition: If gender is predicted as female with more than 90% probability, we define the name as female-sounding. Similarly, if gender is predicted as male with more than 90% probability, we define the name as male sounding. The remaining names are treated as ambiguous, and we assign "unknown gender."

Only TA-assigned grades were used to calculate students’ final course grades. Each assignment accounted for 2% of the author’s final course grade. Peer grades only affected the final course grade of the peer grader. Peer graders received an overall peer grading score (across all four peer reviews) that accounted for 4% of their final course grade. They did not receive any feedback on the quality of their peer reviews before the end of the course. That is, they could not adjust their approach assignment-to-assignment in response to feedback. Peer graders were told that they must complete a peer review and that it must “more or less” match the TA’s evaluations to receive credit. If the reviews were too dissimilar from the TA’s grades, they would not receive credit for completing the peer review. The clear incentive was to match the TAs’ evaluations. If students had prior beliefs about the strictness of TAs’ grades, they should have incorporated those beliefs when completing their peer reviews. These instructions are consistent with the analysis that follows in the paper, which focuses on the correlation between TA-assigned and peer review scores.

The students were not provided information about the identity or gender of the TAs, nor were they provided specific information about the TAs’ grading process. For example, students were not told whether a single TA or multiple TAs would grade their work. They were not told if graders were randomly assigned, or if the same grader or graders would evaluate all of their assignments. They were simply told that the instructor and a group of TAs would evaluate their work blindly using the same rubric as they were provided and assign official grades. The purpose of the peer review and the incentive to match TA scores was only in place to encourage students to thoughtfully interact with the correct answers, which were included on the scoring rubric, after the assignment submission deadline.
Experimental Design Details
Randomization Method
We randomize treatment assignment at the individual level. Randomization by the computer.
Randomization Unit
Individual homework submission
Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters
975 students for 4 assignments leading to 3900 homework submission
Sample size: planned number of observations
3900 homework submission
Sample size (or number of clusters) by treatment arms
975 students for 4 assignments leading to 3900 homework submission
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Institutional Review Boards (IRBs)

IRB Name
University of Florida
IRB Approval Date
IRB Approval Number


Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information


Is the intervention completed?
Data Collection Complete
Data Publication

Data Publication

Is public data available?

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials