Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias

Last registered on November 11, 2019

View Trial History

Pre-Trial

Trial Information

General Information

Title

Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias

RCT ID

AEARCTR-0003574

Initial registration date

May 20, 2019

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

June 10, 2019, 10:29 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated

November 11, 2019, 10:51 AM EST

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Country

United States of America

Region

Primary Investigator

Name

Fabrizio Dell'Acqua

Affiliation

Harvard Business School

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Bo Cowgill

PI Affiliation

Columbia Business School

Contact Investigator

Additional Trial Information

Status

On going

Start date

2019-05-03

End date

2020-01-20

Keywords

Firms & Productivity, Gender, Other

Additional Keywords

algorithms, machine learning, minorities, bias

JEL code(s)

J24, J15, J16, O38, D80, M50

Secondary IDs

Abstract

Why does “algorithmic bias” occur? The two most frequently cited reasons are “biased programmers” and “biased training data.” We quantify the effects of these using a field experiment.

External Link(s)

Registration Citation

Citation

Cowgill, Bo and Fabrizio Dell'Acqua. 2019. "Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias." AEA RCT Registry. November 11. https://doi.org/10.1257/rct.3574-1.1

Former Citation

Cowgill, Bo and Fabrizio Dell'Acqua. 2019. "Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias." AEA RCT Registry. November 11. https://www.socialscienceregistry.org/trials/3574/history/56727

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

Participants will be assigned to one of four conditions, with either biased or unbiased data. We think of the biased data (with no other interventions) as being the control group.

Intervention (Hidden)

We will work with 200 software developers for a short data science competition and divide them in four groups. Also, we will attempt assign this as homework to students of a Machine Learning class. This will be a standard homework, with a typical machine learning assignment, appropriate to the level of the students. Instructors of the course consider this assignment to be beneficial for class learning and well integrated with the overall focus of the class.

Software developers/Machine learning students will be divided in four groups.
- One group is given perfectly representative training data.
- A second is given a “dataset of convenience” – a biased training sample containing who conforms to social stereotypes about who is good at math (i.e., biased towards women). We think of this as the control group.
- A third group receives the “dataset of convenience” and a simple reminder that this data may be biased.
- A fourth group receives the “dataset of convenience,” a reminder that this data may be biased and technical guidance about addressing bias in AI.

We will ask them to predict math literacy scores for a representative sample working-age population of OECD countries.

For hired software engineers, these four groups will then enter a competition in which prizes and bonuses are awarded according to accuracy of predictions. Rewards will be given according to how accurately each programmer labels a set of new observations. The new observations used for testing would be drawn from a representative distribution containing members of minority and majority groups.
For students, homework grades will be given according to how accurately each student labels a set of new observations. The new observations used for testing would be drawn from a representative distribution containing members of minority and majority groups.

With these evaluations, we also experimentally test the effects of different incentives. We communicate students the threshold of accuracy they need to reach in order to pass their assignment, and we randomly assign one of two threshold levels within all four groups. That is, within each one of the four groups, there are two sub-groups: a group has a lower threshold, and another a higher one.
Additionally, we give students who pass the threshold some extra credit. We randomly assign one of two levels of extra credit students gain for the same accuracy improvement. Again, these two conditions will be presents in all four groups outlined above.

Intervention Start Date

2019-05-03

Intervention End Date

2019-11-23

Primary Outcomes

Primary Outcomes (end points)

- The performance of the algorithms developed for the data science competition.
Performance will be evaluated based on accuracy of prediction (for both groups).

Primary Outcomes (explanation)

We use performance to measure and compare bias in the proposed algorithms.
Note that even if programmers were given training data with sample selection issues, they would be evaluated using a sample without selection bias. We feel that this is how many real-world applications work: Models trained using unrepresentative datasets (sometimes
“datasets of convenience”) are often deployed in practice on a larger population.

Additionally, we will measure the effects of different incentives: two different threshold levels to pass the assignment, and two levels of extra credit students gain for the same accuracy improvement.

Secondary Outcomes

Secondary Outcomes (end points)

Secondary Outcomes (explanation)

Experimental Design

We will recruit a diverse set of programmers into a data science contest. As participants register, they will receive a training dataset to utilize in predicting an outcome. The outcome will be related to an event that historically exhibited bias against a particular group.

Experimental Design Details

We will recruit a diverse set of programmers into a data science contest using UpWork. They will be asked to predict math literacy scores for a representative sample of Americans and OECD countries population. Additionally, we will ask students of a Columbia Machine Learning to complete the same task as homework assignment. 264 are enrolled in the class. Instructors of the course consider this assignment to be beneficial for class learning and well integrated with the overall focus of the class.

Subjects will be randomly assigned to one of four different conditions. These conditions are the same for AI practitioners and students, although we treat the two groups separately.
- In the first condition, subjects will receive unbiased training data. We will use a training dataset for these programmers that is free of sample-selection bias – one where we have example outcomes for training on a truly representative group. This will permit programmers to utilize unbiased training data in their algorithms.
- In the second condition, subjects receive similar dataset as above. However, this dataset will contain realistic sample-selection bias against women. This second group will face similar incentives to develop the most accurate predictive algorithm using the training data.
- A third group will receive the same dataset as the second group, but in addition also a reminder this data might be biased.
- Finally, a fourth group will receive the same dataset as the second group, plus a reminder the data might be biased and technical guidance about addressing bias in AI.

All subjects in these groups are asked to develop models and deliver predictions about the math literacy scores for a representative test group of 20000 OECD workers. Treatment arms are randomly assigned.

We will then compare the prediction and performance differences between coders with biased data (and without) and between those from typical high-tech (i.e., men) demographics (and otherwise). By measuring accuracy for different types of predictions on a common scale – by AI practitioners/students with whose input data demographics may vary – we can measure the relative contributions of biased programmers and biased data.

Insofar as the second hypothesis (“biased training data”) is important, a growing technical literature in computer science seeks to reduce algorithmic bias through technical solutions. Our experiment includes a fourth condition to test the effectiveness of technical guidance. Like our other hypotheses, we suspect that lack of technical training may explain bias. However, it may not contribute much. Programmers may not understand, could ignore or incorrectly implement the new techniques. Standard techniques such as simple cross-validation could go a long way, even without the new methods. In addition, companies could alternatively gather higher quality training data, and use standard techniques rather than utilizing new computational methods. The experiment is designed to measure the relative effectiveness of these potential solutions.

We also experimentally test the effects of different incentives. We communicate students the threshold of accuracy they need to reach in order to pass the assignment, and we randomly assign one of two threshold levels within all four groups. That is, within each one of the four groups, there are two sub-groups: some students will have a lower threshold, and some students a higher one.
Additionally, we give students who pass the threshold some extra credit. We randomly assign one of two levels of extra credit students gain for the same accuracy improvement. Again, these two conditions will be presents in all four groups outlined above.

Randomization Method

The randomization will be done in office by a computer

Randomization Unit

Individual programmers

Was the treatment clustered?

Yes

Experiment Characteristics

Sample size: planned number of clusters

464 programmers

Sample size: planned number of observations

464 programmers

Sample size (or number of clusters) by treatment arms

264 programmers in the ML class, equally divided between the four treatments.

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

IRB

Institutional Review Boards (IRBs)

IRB Name

Columbia University

IRB Approval Date

2019-04-18

IRB Approval Number

AAAS2100

Analysis Plan

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?

Data Collection Complete

Data Publication

Is public data available?

Program Files

Reports, Papers & Other Materials

Biased Programmers? Or Biased Training Data? A Field Experiment about Algorithmic Bias

Pre-Trial

General Information

Locations

Primary Investigator

Other Primary Investigator(s)

Additional Trial Information

Registration Citation

Interventions

Primary Outcomes

Secondary Outcomes

Experimental Design

Experiment Characteristics

Institutional Review Boards (IRBs)

Post-Trial

Study Withdrawal

Intervention

Data Publication

Program Files

Relevant Paper(s)

Reports & Other Materials