Experimental Design Details
We will recruit a diverse set of programmers into a data science contest using UpWork. They will be asked to predict math literacy scores for a representative sample of Americans and OECD countries population. Additionally, we will ask students of a Columbia Machine Learning to complete the same task as homework assignment. 264 are enrolled in the class. Instructors of the course consider this assignment to be beneficial for class learning and well integrated with the overall focus of the class.
Subjects will be randomly assigned to one of four different conditions. These conditions are the same for AI practitioners and students, although we treat the two groups separately.
- In the first condition, subjects will receive unbiased training data. We will use a training dataset for these programmers that is free of sample-selection bias – one where we have example outcomes for training on a truly representative group. This will permit programmers to utilize unbiased training data in their algorithms.
- In the second condition, subjects receive similar dataset as above. However, this dataset will contain realistic sample-selection bias against women. This second group will face similar incentives to develop the most accurate predictive algorithm using the training data.
- A third group will receive the same dataset as the second group, but in addition also a reminder this data might be biased.
- Finally, a fourth group will receive the same dataset as the second group, plus a reminder the data might be biased and technical guidance about addressing bias in AI.
All subjects in these groups are asked to develop models and deliver predictions about the math literacy scores for a representative test group of 20000 OECD workers. Treatment arms are randomly assigned.
We will then compare the prediction and performance differences between coders with biased data (and without) and between those from typical high-tech (i.e., men) demographics (and otherwise). By measuring accuracy for different types of predictions on a common scale – by AI practitioners/students with whose input data demographics may vary – we can measure the relative contributions of biased programmers and biased data.
Insofar as the second hypothesis (“biased training data”) is important, a growing technical literature in computer science seeks to reduce algorithmic bias through technical solutions. Our experiment includes a fourth condition to test the effectiveness of technical guidance. Like our other hypotheses, we suspect that lack of technical training may explain bias. However, it may not contribute much. Programmers may not understand, could ignore or incorrectly implement the new techniques. Standard techniques such as simple cross-validation could go a long way, even without the new methods. In addition, companies could alternatively gather higher quality training data, and use standard techniques rather than utilizing new computational methods. The experiment is designed to measure the relative effectiveness of these potential solutions.
We also experimentally test the effects of different incentives. We communicate students the threshold of accuracy they need to reach in order to pass the assignment, and we randomly assign one of two threshold levels within all four groups. That is, within each one of the four groups, there are two sub-groups: some students will have a lower threshold, and some students a higher one.
Additionally, we give students who pass the threshold some extra credit. We randomly assign one of two levels of extra credit students gain for the same accuracy improvement. Again, these two conditions will be presents in all four groups outlined above.