Evaluating Evaluations

Last registered on June 24, 2024

Pre-Trial

Trial Information

General Information

Title
Evaluating Evaluations
RCT ID
AEARCTR-0013780
Initial registration date
June 06, 2024

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
June 24, 2024, 9:30 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
University of Florida

Other Primary Investigator(s)

PI Affiliation
University of Florida
PI Affiliation
University of Florida

Additional Trial Information

Status
In development
Start date
2024-06-03
End date
2024-08-09
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
Many educational and professional outcomes depend on performance evaluations. These evaluations might be in the form of admission or hiring decisions, or selection of an award or scholarship/grant, or even loan applications. In most settings, applicants are evaluated by either a committee or an individual with limited time to decide whether the application is above or below the bar. While the accuracy and unbiasedness of these evaluations is crucial for an efficient allocation of talent and resources, they might be affected by irrelevant factors, such as the order in which an applicant is evaluated or the time spent in each evaluation. We explore the accuracy of sequential evaluations using a large introductory economics course at a large public university. Students are assigned to grade 20 short writing assignments (SWAs) that were submitted by students from a previous semester. They use the same rubric and training that were used to grade these old SWAs. The current students are given their 20 assignments in a randomized sequential order, with randomly assigned names on each assignment, and instructed to grade them correctly. They receive points depending on the nearness of their scores to the “correct score.” Each of the old SWAs was graded blindly by multiple teaching assistants, so we use the average of the scores for each SWA as that SWA’s correct score. This will allow us to identify the effect of the composition of the SWAs on the accuracy of grading. We will also examine whether the order in which SWAs are graded, as well as the time spent, has any impact on the accuracy of the evaluations.
External Link(s)

Registration Citation

Citation
Pollard, Garrison, Mark Rush and Perihan Saygin. 2024. "Evaluating Evaluations." AEA RCT Registry. June 24. https://doi.org/10.1257/rct.13780-1.0
Experimental Details

Interventions

Intervention(s)
Students are given 20 assignments to grade in a randomized sequential order, with randomly assigned names on each assignment
Intervention Start Date
2024-06-03
Intervention End Date
2024-06-07

Primary Outcomes

Primary Outcomes (end points)
Student assigned scores at the rubric level for each question on the assignment: Each student evaluates 20 randomly assigned short writing assignments (SWAs) in a random order using the same rubric.

Student assigned content and writing Scores: Rubric divides the grades into two parts: 70% of the grade consists of evaluating the content of the SWA and 30% is the writing quality.
Timeline of completing the grading of each SWA1 for each student: We observe when students start grading each SWA and when they complete it.
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
These variables will allow us to understand the role of the characteristics of student evaluators.
Exam grades: Within several days after each short writing assignment’s grades are posted, students will be taking exams consisting of 40 multiple choice questions. This will provide us an additional measure of student performance on the same subject.

Course Participation and Engagement: Learning Management System (Canvas) activity and resource utilization. The learning management system collects hourly-aggregated page and file view statistics for all students (with time stamps for first and last view within the hour). We will use these statistics to construct several measures of course engagement. Active days are the number of days in a period a student viewed any page or resource on Canvas. Using time stamps, we will also measure procrastination (how late after an assignment is first available that a student first works on it).
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
This is a field experiment conducted in a large principles of economics course at a large, comprehensive university. Students are given two Short Writing Assignments (SWA) and there are two types. One assignment, the second (SWA2), will be to answer a set of questions. The other assignment, the first (SWA1), will be for students to grade previous semesters students’ answers.

SWA2: Students will be given a prompt with several questions, and they will need to answer the questions. These assignments are typed. There is no required minimum or maximum length, but they are advised that they do not need more than 250 words. These SWAs are composed in essay form. They are evaluated blindly by the TAs for accuracy and writing quality.

SWA1: Students are given 20 SWAs that were submitted by students from a previous semester. They grade these 20 SWAs on Qualtrics using the same rubric and training that were actually used to grade these old SWAs. The rubric has 10 components, and each is worth from 0 to 10 points for a total of 100 possible points. Students grade each of the 20 SWAs they received using this rubric.

TA-assigned grades from previous semester: For each SWA we give students to grade, we already have the total score obtained from the grading from the previous semester. These assignments were randomly assigned to a TA grader and graders were responsible for only grading and did not interact with students in any way. Grading was double-blind such that neither student nor grader was aware of the identity of the other. TA graders were provided the same standard rubric of ten elements, all valued at ten points each. In order to obtain more precise measures of SWA grades from the previous semester, each SWA was regraded blindly by two additional randomly-assigned graders after the end of the previous semester. That is, there are 3 different blindly graded scores by 3 different TAs and the average of these 3 grades serves as a benchmark. We call this the "correct" score and students were told to match the correct score without knowing the specific calculation of it. They were only told that the correct scores were obtained blindly by several TAs and the instructor in the previous semester.

In SWA1, students are incentivized to do their best in grading these 20 SWAs by matching the "correct" scores. They are instructed that their grade will be calculated as follows:

We will compare the total score a student gave each of the 20 SWAs they graded to the correct total score on each SWA. If their total is within ± 3, they get 5 points for that SWA; if it is within ± 5, they get 4 points for that SWA; if it is within ± 7, they get 3 points for that SWA; if it is within ± 9, they get 2 points for that SWA; and, if it is within ± 11, they get 1 point for that SWA. Plus, they get 1 bonus point if their score exactly matches the correct score. We do this procedure for all 20 of their graded SWAs and their total number of points will be their score on SWA1.

On the day of assignment opening, students received instructions both on the Canvas site and in lecture on how to complete the assignment. These instructions included how to access the Qualtrics survey, an overview of the prompt and rubric, four sample submissions with writing quality scores, an in-lecture demonstration of completing the survey, and a description of how they would be evaluated. Students were invited to ask questions both in lecture and through email. The in-class demonstration and sample submissions included submissions that were excluded from the randomization pool.

The deadlines for the SWAs are as follows and everyone is granted an automatic, no-penalty one-day extension. SWA1 will open on at 11:00 am on June 3, 2024 and the deadline will be June 6 at 11:59pm. Although, if the student needs it, there is the automatic extension to 11:59pm June 7th.

The process of assigning 20 SWAs:

The randomization pool of 600 previously graded SWAs was created from a pool of 845 real submissions from Spring 2024. Only properly submitted anonymous assignments were selected. Regraded assignments and those with outlier scores for certain rubric elements were also excluded. Of the remaining 770 submissions, 600 were selected at random to serve as the randomization pool. Each assignment was randomly assigned to one of twenty "banks," which would consist of two versions of each submission: one with a randomly-chosen male-sounding first name (with replacement) and one with a randomly-chosen female-sounding name (with replacement). Last initials were also randomly assigned at the file-level (with replacement).

Qualtrics chooses a "bank" at random (without replacement) to display and then chooses a file at random within that bank to present to the student. It is therefore not possible for a student to observe the same file twice. However, a student could potentially observe the same first name with different last initials for different files.

In addition to student responses, Qualtrics also collects metadata including browser name and device type, time spent on each page, time stamps for each click on the rubric and file, and the order in which the questions appeared and answered.

Name selection: We compiled a list of 300 names to use for the random names on the short writing assignments. The list was compiled first by collecting the Spring 2024 course roster and using R's gender package to predict gender by first name. After keeping only those that were >98% or <2% probability female and appeared at least 1000 times in the R database. We also manually removed some names that were not obviously one gender.

Restricting the analysis: If students do not submit the SWA1 (grading of 20 SWAs from previous semester) by the end of the extension, they receive a zero for the score. Any late or incomplete submissions will not be included in our analysis.
Experimental Design Details
Not available
Randomization Method
We randomize treatment assignment at the individual level. Randomization by the computer.
Randomization Unit
Individual participant.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
NA
Sample size: planned number of observations
20 Short Writing Assignments for each student, for approximately 261 students yielding 5220 observations (expected to change based on drop/adds or late/incomplete submissions.)
Sample size (or number of clusters) by treatment arms
No differential treatment arms.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
IRB

Institutional Review Boards (IRBs)

IRB Name
University of Florida - submitted waiting for approval
IRB Approval Date
2024-06-10
IRB Approval Number
IRB202400871