x

We are happy to announce that all trial registrations will now be issued DOIs (digital object identifiers). For more information, see here.
Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias
Last registered on June 10, 2019

Pre-Trial

Trial Information
General Information
Title
Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias
RCT ID
AEARCTR-0003574
Initial registration date
May 20, 2019
Last updated
June 10, 2019 10:29 PM EDT
Location(s)
Region
Primary Investigator
Affiliation
Columbia Business School
Other Primary Investigator(s)
PI Affiliation
Columbia Business School
Additional Trial Information
Status
Completed
Start date
2019-05-03
End date
2019-06-01
Secondary IDs
Abstract
Why does “algorithmic bias” occur? The two most frequently cited reasons are “biased programmers” and “biased training data.” We quantify the effects of these using a field experiment.
External Link(s)
Registration Citation
Citation
Cowgill, Bo and Fabrizio Dell'Acqua. 2019. "Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias." AEA RCT Registry. June 10. https://doi.org/10.1257/rct.3574-1.0.
Former Citation
Cowgill, Bo and Fabrizio Dell'Acqua. 2019. "Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias." AEA RCT Registry. June 10. https://www.socialscienceregistry.org/trials/3574/history/47902.
Experimental Details
Interventions
Intervention(s)
Participants will be assigned to one of four conditions, with either biased or unbiased data. We think of the biased data (with no other interventions) as being the control group.
Intervention Start Date
2019-05-03
Intervention End Date
2019-06-01
Primary Outcomes
Primary Outcomes (end points)
- The performance of the algorithms developed for the data science competition.
Performance will be evaluated based on accuracy of prediction (for both groups).
Primary Outcomes (explanation)
We use performance to measure and compare bias in the proposed algorithms.
Note that even if programmers were given training data with sample selection issues, they would be evaluated using a sample without selection bias. We feel that this is how many real-world applications work: Models trained using unrepresentative datasets (sometimes
“datasets of convenience”) are often deployed in practice on a larger population.

Additionally, we will measure the effects of different incentives: two different threshold levels to pass the assignment, and two levels of extra credit students gain for the same accuracy improvement.
Secondary Outcomes
Secondary Outcomes (end points)
Secondary Outcomes (explanation)
Experimental Design
Experimental Design
We will recruit a diverse set of programmers into a data science contest. As participants register, they will receive a training dataset to utilize in predicting an outcome. The outcome will be related to an event that historically exhibited bias against a particular group.
Experimental Design Details
We will recruit a diverse set of programmers into a data science contest using UpWork. They will be asked to predict math literacy scores for a representative sample of Americans and OECD countries population. Additionally, we will ask students of a Columbia Machine Learning to complete the same task as homework assignment. 264 are enrolled in the class. Instructors of the course consider this assignment to be beneficial for class learning and well integrated with the overall focus of the class. Subjects will be randomly assigned to one of four different conditions. These conditions are the same for AI practitioners and students, although we treat the two groups separately. - In the first condition, subjects will receive unbiased training data. We will use a training dataset for these programmers that is free of sample-selection bias – one where we have example outcomes for training on a truly representative group. This will permit programmers to utilize unbiased training data in their algorithms. - In the second condition, subjects receive similar dataset as above. However, this dataset will contain realistic sample-selection bias against women. This second group will face similar incentives to develop the most accurate predictive algorithm using the training data. - A third group will receive the same dataset as the second group, but in addition also a reminder this data might be biased. - Finally, a fourth group will receive the same dataset as the second group, plus a reminder the data might be biased and technical guidance about addressing bias in AI. All subjects in these groups are asked to develop models and deliver predictions about the math literacy scores for a representative test group of 20000 OECD workers. Treatment arms are randomly assigned. We will then compare the prediction and performance differences between coders with biased data (and without) and between those from typical high-tech (i.e., men) demographics (and otherwise). By measuring accuracy for different types of predictions on a common scale – by AI practitioners/students with whose input data demographics may vary – we can measure the relative contributions of biased programmers and biased data. Insofar as the second hypothesis (“biased training data”) is important, a growing technical literature in computer science seeks to reduce algorithmic bias through technical solutions. Our experiment includes a fourth condition to test the effectiveness of technical guidance. Like our other hypotheses, we suspect that lack of technical training may explain bias. However, it may not contribute much. Programmers may not understand, could ignore or incorrectly implement the new techniques. Standard techniques such as simple cross-validation could go a long way, even without the new methods. In addition, companies could alternatively gather higher quality training data, and use standard techniques rather than utilizing new computational methods. The experiment is designed to measure the relative effectiveness of these potential solutions. We also experimentally test the effects of different incentives. We communicate students the threshold of accuracy they need to reach in order to pass the assignment, and we randomly assign one of two threshold levels within all four groups. That is, within each one of the four groups, there are two sub-groups: some students will have a lower threshold, and some students a higher one. Additionally, we give students who pass the threshold some extra credit. We randomly assign one of two levels of extra credit students gain for the same accuracy improvement. Again, these two conditions will be presents in all four groups outlined above.
Randomization Method
The randomization will be done in office by a computer
Randomization Unit
Individual programmers
Was the treatment clustered?
Yes
Experiment Characteristics
Sample size: planned number of clusters
464 programmers
Sample size: planned number of observations
464 programmers
Sample size (or number of clusters) by treatment arms
264 programmers in the ML class, equally divided between the four treatments.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Supporting Documents and Materials

There are documents in this trial unavailable to the public. Use the button below to request access to this information.

Request Information
IRB
INSTITUTIONAL REVIEW BOARDS (IRBs)
IRB Name
Columbia University
IRB Approval Date
2019-04-18
IRB Approval Number
AAAS2100
Post-Trial
Post Trial Information
Study Withdrawal
Intervention
Is the intervention completed?
No
Is data collection complete?
Data Publication
Data Publication
Is public data available?
No
Program Files
Program Files
Reports and Papers
Preliminary Reports
Relevant Papers