Experimental Design
We use data from a peer grading assessment which wwas part of the syllabus in an introductory macroeconomics course at a large, comprehensive research university in Fall 2018. Students completed four short essay assignments during the course, which were then evaluated by a randomly matched classmate. Each assignment was based on a clear prompt to which there were objectively correct answers. We compare the peer-assigned grades to official grades of the same assignment submissions and explore whether these two scores systematically differ. Official grades are determined by trained TAs who are randomly assigned to specific students and grade their assignments blindly. TA assignments were re-randomized for each of the four assignments.
All enrolled students were required to complete four short essay assignments. These essay assignments asked students to answer specific questions about an economic graph. There was a single objectively correct answer to each question. No outside research was required, and neither subjective analysis nor students’ own opinions were solicited. These questions were the types of questions that an instructor would typically include as a free-response question on an exam in a smaller course. Students were told that their submission should be composed in “essay form”, and that while economics content would play a much larger role in determining their grade, writing quality would also play a role. They were also told that there was no minimum or maximum required length, but that a strong answer could be provided in approximately 150 words.
Assignments were submitted into an electronic course management system (Canvas). Each submission was graded blindly by a randomly assigned trained TA who assigned the official score for inclusion in the student’s final course grade calculation, and by a randomly assigned (non-blind) peer grader for evaluation. There were 5 male and 4 female TAs. All TAs were economics graduate students. Both blind TAs and non-blind peer graders evaluated the assignments using a common scoring rubric. Rubrics contained two types of questions: between seven and eleven economic content questions, each of which had an objectively correct answer, and three writing quality questions. Each question on the rubric was awarded a numeric score between zero and ten. Partial credit was available for economic content questions if, for example, a student provided the correct answer but used incorrect units of measurement. Writing questions were scored on the same zero-to-ten scale. The grading procedure generated a content subscore (defined as the percentage of possible content points earned), a writing subscore (defined as the percentage of possible writing points earned), and an overall assignment grade (defined as the percentage of total possible points earned). Peer graders had one week to complete the peer review. Official TA grades were released after one week. Neither group had access to the other group's scores.
When completing their evaluations, both TAs and peer graders had access to the same information and scoring rubrics. The only difference was that peer graders could view the submission author's first and last name and possibly a small picture of the student’s face. The TAs graded blindly; they did not have access to the names or the pictures of the authors. Importantly, this experiment took place in a very large class. Most of students did not know each other. The majority of peer graders did not know the gender of the author. Instead, they could infer the author's gender from the name or possibly look at the picture of the student. Many students did not upload pictures into the system, and even when they did, these pictures were very small (i.e., approximately the size of a dime). Rather than using the students' actual gender, we use the gender predicted by their names. Using genderize.io, we predict the gender probabilities for all first names. To be conservative, we used the following definition: If gender is predicted as female with more than 90% probability, we define the name as female-sounding. Similarly, if gender is predicted as male with more than 90% probability, we define the name as male sounding. The remaining names are treated as ambiguous, and we assign "unknown gender."
Only TA-assigned grades were used to calculate students’ final course grades. Each assignment accounted for 2% of the author’s final course grade. Peer grades only affected the final course grade of the peer grader. Peer graders received an overall peer grading score (across all four peer reviews) that accounted for 4% of their final course grade. They did not receive any feedback on the quality of their peer reviews before the end of the course. That is, they could not adjust their approach assignment-to-assignment in response to feedback. Peer graders were told that they must complete a peer review and that it must “more or less” match the TA’s evaluations to receive credit. If the reviews were too dissimilar from the TA’s grades, they would not receive credit for completing the peer review. The clear incentive was to match the TAs’ evaluations. If students had prior beliefs about the strictness of TAs’ grades, they should have incorporated those beliefs when completing their peer reviews. These instructions are consistent with the analysis that follows in the paper, which focuses on the correlation between TA-assigned and peer review scores.
The students were not provided information about the identity or gender of the TAs, nor were they provided specific information about the TAs’ grading process. For example, students were not told whether a single TA or multiple TAs would grade their work. They were not told if graders were randomly assigned, or if the same grader or graders would evaluate all of their assignments. They were simply told that the instructor and a group of TAs would evaluate their work blindly using the same rubric as they were provided and assign official grades. The purpose of the peer review and the incentive to match TA scores was only in place to encourage students to thoughtfully interact with the correct answers, which were included on the scoring rubric, after the assignment submission deadline.