Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias

Last registered on November 11, 2019

Pre-Trial

Trial Information

General Information

Title
Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias
RCT ID
AEARCTR-0003574
Initial registration date
May 20, 2019

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
June 10, 2019, 10:29 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
November 11, 2019, 10:51 AM EST

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Region

Primary Investigator

Affiliation
Harvard Business School

Other Primary Investigator(s)

PI Affiliation
Columbia Business School

Additional Trial Information

Status
On going
Start date
2019-05-03
End date
2020-01-20
Secondary IDs
Abstract
Why does “algorithmic bias” occur? The two most frequently cited reasons are “biased programmers” and “biased training data.” We quantify the effects of these using a field experiment.
External Link(s)

Registration Citation

Citation
Cowgill, Bo and Fabrizio Dell'Acqua. 2019. "Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias." AEA RCT Registry. November 11. https://doi.org/10.1257/rct.3574-1.1
Former Citation
Cowgill, Bo and Fabrizio Dell'Acqua. 2019. "Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias." AEA RCT Registry. November 11. https://www.socialscienceregistry.org/trials/3574/history/56727
Experimental Details

Interventions

Intervention(s)
Participants will be assigned to one of four conditions, with either biased or unbiased data. We think of the biased data (with no other interventions) as being the control group.
Intervention Start Date
2019-05-03
Intervention End Date
2019-11-23

Primary Outcomes

Primary Outcomes (end points)
- The performance of the algorithms developed for the data science competition.
Performance will be evaluated based on accuracy of prediction (for both groups).
Primary Outcomes (explanation)
We use performance to measure and compare bias in the proposed algorithms.
Note that even if programmers were given training data with sample selection issues, they would be evaluated using a sample without selection bias. We feel that this is how many real-world applications work: Models trained using unrepresentative datasets (sometimes
“datasets of convenience”) are often deployed in practice on a larger population.

Additionally, we will measure the effects of different incentives: two different threshold levels to pass the assignment, and two levels of extra credit students gain for the same accuracy improvement.

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
We will recruit a diverse set of programmers into a data science contest. As participants register, they will receive a training dataset to utilize in predicting an outcome. The outcome will be related to an event that historically exhibited bias against a particular group.
Experimental Design Details
We will recruit a diverse set of programmers into a data science contest using UpWork. They will be asked to predict math literacy scores for a representative sample of Americans and OECD countries population. Additionally, we will ask students of a Columbia Machine Learning to complete the same task as homework assignment. 264 are enrolled in the class. Instructors of the course consider this assignment to be beneficial for class learning and well integrated with the overall focus of the class.

Subjects will be randomly assigned to one of four different conditions. These conditions are the same for AI practitioners and students, although we treat the two groups separately.
- In the first condition, subjects will receive unbiased training data. We will use a training dataset for these programmers that is free of sample-selection bias – one where we have example outcomes for training on a truly representative group. This will permit programmers to utilize unbiased training data in their algorithms.
- In the second condition, subjects receive similar dataset as above. However, this dataset will contain realistic sample-selection bias against women. This second group will face similar incentives to develop the most accurate predictive algorithm using the training data.
- A third group will receive the same dataset as the second group, but in addition also a reminder this data might be biased.
- Finally, a fourth group will receive the same dataset as the second group, plus a reminder the data might be biased and technical guidance about addressing bias in AI.

All subjects in these groups are asked to develop models and deliver predictions about the math literacy scores for a representative test group of 20000 OECD workers. Treatment arms are randomly assigned.

We will then compare the prediction and performance differences between coders with biased data (and without) and between those from typical high-tech (i.e., men) demographics (and otherwise). By measuring accuracy for different types of predictions on a common scale – by AI practitioners/students with whose input data demographics may vary – we can measure the relative contributions of biased programmers and biased data.

Insofar as the second hypothesis (“biased training data”) is important, a growing technical literature in computer science seeks to reduce algorithmic bias through technical solutions. Our experiment includes a fourth condition to test the effectiveness of technical guidance. Like our other hypotheses, we suspect that lack of technical training may explain bias. However, it may not contribute much. Programmers may not understand, could ignore or incorrectly implement the new techniques. Standard techniques such as simple cross-validation could go a long way, even without the new methods. In addition, companies could alternatively gather higher quality training data, and use standard techniques rather than utilizing new computational methods. The experiment is designed to measure the relative effectiveness of these potential solutions.

We also experimentally test the effects of different incentives. We communicate students the threshold of accuracy they need to reach in order to pass the assignment, and we randomly assign one of two threshold levels within all four groups. That is, within each one of the four groups, there are two sub-groups: some students will have a lower threshold, and some students a higher one.
Additionally, we give students who pass the threshold some extra credit. We randomly assign one of two levels of extra credit students gain for the same accuracy improvement. Again, these two conditions will be presents in all four groups outlined above.
Randomization Method
The randomization will be done in office by a computer
Randomization Unit
Individual programmers
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
464 programmers
Sample size: planned number of observations
464 programmers
Sample size (or number of clusters) by treatment arms
264 programmers in the ML class, equally divided between the four treatments.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
IRB

Institutional Review Boards (IRBs)

IRB Name
Columbia University
IRB Approval Date
2019-04-18
IRB Approval Number
AAAS2100

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials