What Drives Different Assessments of Performance on Coding Tasks?

Last registered on January 08, 2025

Pre-Trial

Trial Information

General Information

Title
What Drives Different Assessments of Performance on Coding Tasks?
RCT ID
AEARCTR-0009816
Initial registration date
December 14, 2022

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
January 03, 2023, 4:24 PM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
January 08, 2025, 2:21 PM EST

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Region

Primary Investigator

Affiliation
University of Michigan

Other Primary Investigator(s)

PI Affiliation
University of Toronto

Additional Trial Information

Status
In development
Start date
2022-10-01
End date
2025-12-01
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study focuses on evaluations of in coding interviews, which are used to hire computer programmers. We aim to shed light on the mechanisms underlying these differences in these ratings, including differences in coding quality and style, and coder effort.
External Link(s)

Registration Citation

Citation
Craig, Ashley and Clémentine Van Effenterre. 2025. "What Drives Different Assessments of Performance on Coding Tasks?." AEA RCT Registry. January 08. https://doi.org/10.1257/rct.9816-1.2
Experimental Details

Interventions

Intervention(s)
We aim to assess how software developers evaluate pieces of code written by computer programmers. Our experiment uses a large set of de-identified code blocks from an online coding platform, which span coders of different skill and problems of different levels of difficulty. For each code block, we will have access to the code's objective measures of performance including sub-test results (e.g., whether it runs, whether it produces correct answers to unit tests etc.). Using these data, we will ask evaluators to judge the quality of the code using the same Likert scales as on the platform.
Intervention (Hidden)
We aim to assess whether software developers evaluate pieces of code differently if they know the gender of the coder, controlling for the objective quality of the code. Our experiment uses a large set of de-identified code blocks written by a set of men and women on an online platform. This spans coders of different skill and problems of different levels of difficulty. For each code block, we will have access to the code's objective measures of performance including sub-test results (e.g., whether it runs, whether it produces correct answers to unit tests etc.). Using these data, we will ask evaluators to judge the quality of the code using the same Likert scales as on the platform.

Depending on the treatment condition, the evaluator will be aware of the gender or other basic information about the programmer who wrote the code (but will never be given identifying information). Using these evaluations, we will ask: (i) whether there are perceived differences in the quality of the code written by men and women; and (ii) how those perceived differences change when the evaluator is aware of the gender of the coder. This will let us test whether there are any unobservable dimensions of performance that are correlated with gender and driving the residual gender gaps that we see in our data. We will attempt to understand any such residual gaps by examining particular dimensions of performance as laid out in our pre-analysis plan.
Intervention Start Date
2023-01-15
Intervention End Date
2023-02-15

Primary Outcomes

Primary Outcomes (end points)
Subjective ratings of code quality
Primary Outcomes (explanation)
Our primary outcome is evaluators’ subjective ratings of the code quality, whether it differs by the gender of the coder and by the treatment condition. For each block of code, respondents will be asked to rate problems on a scale from 1 to 4.

Secondary Outcomes

Secondary Outcomes (end points)
Evaluators’ prediction of the candidate’s score from the automated evaluation tool.
Secondary Outcomes (explanation)
This is a continuous variable from [0,1]. A third outcome variable is evaluators’ prediction of the candidate hireability score. This is measured on a Likert-scale from 1 to 4, and allows us to draw a more direct link between our findings and hiring outcomes. Additionally, we will measure how much time respondents spend on each question to measure fatigue and inattention, and how this varies over time.

Experimental Design

Experimental Design
Our design relies on multiple observations per subject. Each participant will evaluate 4 code blocks.
Experimental Design Details
Let N be the number of evaluators and P the number of problems by evaluator. As mentioned before, our sample of code blocks is stratified by gender and performance, such that P/2 code blocks are written by women, among which P/4 are high-scoring code blocks according to the platform objective device. Each evaluator i is assigned a set of P problems in a random order. We use a within-subject design. We define Rj = 0 for a blind problem j (if the gender of the coder is not revealed), Rj = 1 for a non-blind problem j (if the gender of the coder is revealed). For each evaluator i, the gender of the coder will be revealed for half of the problems. To account for potential priming effect, we plan to randomize whether the gender of the coder is revealed in the first or in the second half of the study.
Randomization Method
Randomization will be done by a computer.
Randomization Unit
We use a within-subject design: half of the codes seen by an evaluator will be in the treated group, half in the control group. The order in which the treated half of the codes is seen will be randomized. The order of the codes within each condition will be randomized.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
400 evaluators.
Sample size: planned number of observations
1600 observations.
Sample size (or number of clusters) by treatment arms
800 treatment code blocks, 800 non-non-treated.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
IRB

Institutional Review Boards (IRBs)

IRB Name
University of Toronto Research Oversight and Compliance Office — Human Research Ethics Program
IRB Approval Date
2022-10-06
IRB Approval Number
41662
Analysis Plan

Analysis Plan Documents

Pre-analysis plan

MD5: dcda8c268891601c2a8a769d18346b2e

SHA1: dbe57ea15edb7fc0dae1fbf4c24eb14bce7cbe8d

Uploaded At: February 17, 2023

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials