A Field Experiment on Group Bias in Machine Learning Classification

Last registered on October 24, 2018

Pre-Trial

Trial Information

General Information

Title
A Field Experiment on Group Bias in Machine Learning Classification
RCT ID
AEARCTR-0003425
Initial registration date
October 15, 2018

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
October 24, 2018, 4:13 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
Georgia Institute of Technology

Other Primary Investigator(s)

PI Affiliation
Georgia Institute of Technology
PI Affiliation
Georgia Institute of Technology

Additional Trial Information

Status
In development
Start date
2018-11-01
End date
2019-03-31
Secondary IDs
Abstract
Many algorithms that conceal hidden biases are being used in business and government to make important decisions in society. For example, historical data used to train machine learning (ML) algorithms are known to unintentionally discriminate against minority groups (e.g. by gender, race, etc.) depending on the nature of the observational data. Researchers typically have little to no control over the sample characteristics of the training data and often rely on technical solutions after-the-fact. Field experiments can be an effective alternative to rigorously test and potentially mitigate the effects of algorithmic bias in a number of applications. In this study, we implement a randomized controlled trial (RCT) to detect and mitigate possible bias in the context of electric vehicle (EV) usage. Expert and non-expert participants will complete a classification task using reviews from a popular electric vehicle mobile app. The data collected will be used for training a classification algorithm using neural network-based language models. In addition, policy preferences will be collected during the intervention to determine if the review classification tasks influence participant beliefs as they relate to electric vehicle infrastructure.
External Link(s)

Registration Citation

Citation
Asensio, Omar Isaac, Sooji Ha and Daniel Marchetto. 2018. "A Field Experiment on Group Bias in Machine Learning Classification." AEA RCT Registry. October 24. https://doi.org/10.1257/rct.3425-1.0
Former Citation
Asensio, Omar Isaac, Sooji Ha and Daniel Marchetto. 2018. "A Field Experiment on Group Bias in Machine Learning Classification." AEA RCT Registry. October 24. https://www.socialscienceregistry.org/trials/3425/history/36196
Experimental Details

Interventions

Intervention(s)
Intervention Start Date
2018-11-01
Intervention End Date
2019-03-31

Primary Outcomes

Primary Outcomes (end points)
The key outcomes include: (i) topic classifications for contents of electric vehicle station reviews and (ii) policy preferences on support for EV programs.
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
We will deploy a survey questionnaire to build a training dataset for machine learning classification tasks. Human raters will be asked to read short text related to electric vehicle experiences and will be asked to classify each review as either positive or negative and identify one or more topic categories that describe the text from a list of predefined choices. All reviews that include potentially sensitive information such as cell phone number or email have been redacted using the terms [phone number] or [email address]. Each participant will classify a total of 20 randomly selected reviews from a set of 127,257 publicly available electric vehicle charging station reviews, as provided by a popular EV charging station locator app. After categorizing the reviews, participants will be asked for short demographic information. Pre- and post- task policy preferences will be collected for analysis. No personally identifiable information will be collected or shared.
Experimental Design Details
Randomization Method
The randomization algorithm will holdout approximately 2,000 randomly selected reviews per campaign without replacement in order to calculate inter-rater reliability score. The remaining reviews will be randomly sampled from a universe of approximately 125,000 reviews using a computer generated randomization procedure.
Randomization Unit
We are randomizing reviews at the individual level. For the holdout sample for inter-rater reliability, we are randomizing over a cluster of 3 individuals per review in order to activate wisdom of crowds and tie-breaker in case of disagreement.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
We have approximately 127,257 station reviews. After separating the holdout sample, we will randomly assign the remaining 125,257 reviews into 1000 clusters of 15 reviews. Each respondent will be assigned 20 reviews, 15 of which come from the clusters of 15 and 5 from the holdout for scientific reliability measures.
Sample size: planned number of observations
We will have 1000 U.S. general population participants completing approximately 17,000 reviews (15 x 1000 + 2000 holdout) to build the training set, which is greater than our minimum 10% sampling target (12,725 reviews) for the training set. We will then repeat the sampling procedure to 1000 subject matter experts for 17,000 additional observations, for a total of 34,000 observations in the final training set for classification.
Sample size (or number of clusters) by treatment arms
1000 participants will be drawn from the U.S. general population and 1000 participants will be drawn from a crowd of subject matter experts for a total of 2000 participants in the full study.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Based on power calculations proposed by Kohavi et. al. (2007), our minimum difference between our outcome evaluation criteria (OEC) is 0.632 Likert units, given a sample size of 1000, number of variants, r, assumed to be in equal size, and a s.d. of 0.25. This power calculation is for a desired confidence level of 95% and a desired power of 90%. We are therefore well above the threshold at our sample of 2000 incentivized participants.
IRB

Institutional Review Boards (IRBs)

IRB Name
Online Human Rater Survey for Machine Learning in Electric Vehicle Networks
IRB Approval Date
2018-08-31
IRB Approval Number
Protocol H18250

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials