On the Fairness of Machine-Assisted Human Decisions

Last registered on November 18, 2022

Pre-Trial

Trial Information

General Information

Title
On the Fairness of Machine-Assisted Human Decisions
RCT ID
AEARCTR-0010416
Initial registration date
November 14, 2022

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
November 18, 2022, 12:07 PM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Primary Investigator

Affiliation
Stanford Graduate School of Business

Other Primary Investigator(s)

PI Affiliation
Columbia Law School
PI Affiliation
Stanford Graduate School of Business

Additional Trial Information

Status
In development
Start date
2022-11-16
End date
2022-12-14
Secondary IDs
Prior work
This trial is based on or builds upon one or more prior RCTs.
Abstract
We are studying how the structure of information presented to human decision-makers affects their performance and bias in a prediction task.

Specifically, study subjects predict the performance on math tasks of multiple test-takers from a previous experiment. For each test-taker, the study subjects see (i) test-taker characteristics (age, gender, education) as well as (ii) assistance in the form of an average of the performance of other test-takers with similar characteristics.

The main intervention is variation in the information by which the performance of previous test-takers is averaged. In the treatment conditions, averages are formed separately by gender. In the control conditions, averages are taken jointly across genders.

The main outcome of interest is the average difference in assessments of women relative to men. We hypothesize that supplying averages that vary by gender increase the predictions of women’s performance relative to predictions of men’s performance.

Study subjects are US adults recruited through Prolific. We will report summary statistics of age, gender, and education characteristics with our data analysis.

Conditions are assigned randomly at the subject level. There are three treatment conditions and three corresponding control conditions. The total target sample size is 1250.
External Link(s)

Registration Citation

Citation
Gillis, Talia, Bryce McLaughlin and Jann Spiess. 2022. "On the Fairness of Machine-Assisted Human Decisions." AEA RCT Registry. November 18. https://doi.org/10.1257/rct.10416-1.0
Experimental Details

Interventions

Intervention(s)
We change the information available to decision-maker's in a skills evaluation task by adjusting an assistant which feeds them conditional averages.
Intervention Start Date
2022-11-16
Intervention End Date
2022-12-14

Primary Outcomes

Primary Outcomes (end points)
The average evaluations of female test-takers and the average evaluations of male test-takers across six treatment conditions
Primary Outcomes (explanation)
As primary outcomes of interest, we will report the average evaluations of female test-takers and the average evaluations of male test-takers across six treatment conditions. As our main test, we will compare the difference between the average evaluation of female test-takers and the average evaluation of male test-takers across all treatment conditions to the same difference across all control conditions. Our null hypothesis is that this difference is no larger across the treatment conditions than across the control conditions. Our (one-sided) alternative hypothesis is that the intervention leads to women being evaluated relatively better. As an additional test, we will check whether the average difference in evaluations of female test-takers relative to male test-takers is lower than zero (null hypothesis: at least zero) and lower than the true average difference in performance (null hypothesis: at least the same).

Secondary Outcomes

Secondary Outcomes (end points)
The differences in average evaluations of female vs male test-takers separately by early (first four test-takers shown) and late evaluations (last four test-takers).

The differences in average evaluations of female vs male test-takers separately by female and male subjects who evaluate them
takers shown.
Secondary Outcomes (explanation)
Our secondary outcomes of interest are the differences in average evaluations of female vs male test-takers separately by early (first four test-takers shown) and late evaluations (last four test-takers shown). Here we plan to test whether the effect of the intervention is the same during early and late interventions.

A tertiary outcome of interest are the differences in average evaluations of female vs male test- takers separately by female and male subjects who evaluate them. Here we plan to test whether at baseline (control conditions 0−, 1−, 2−) men evaluate women relatively lower (null hypothesis: the difference in average evaluations of female vs male test-takers is not higher for female evaluators than for male evaluators).

Experimental Design

Experimental Design
Study subjects will look at a series of up to 12 randomly ordered profiles of test-takers from a previous experiment who took a six-question math test and estimate their performance on the test. (Profile data is from Cecelia Ridgeway of Stanford University and Tamar Kricheli-Katz of Tel Aviv University for their upcoming work “Behavioral responses to the changing world of gender.” We received the data directly from the authors; it is de-identified and is not publicly available.) Participants are rewarded based on the accuracy of their predictions proportional to the square of the deviation, so that predicting their best guess (posterior expectation) is theoretically optimal. For each test-taker, study subjects are shown the test-taker’s gender g, and two additional covariates, x1 (age) and x2 (having acquired a four year college degree).

All study subjects also receive baseline information about the composition of test-takers and overall average performance. Subjects are randomized into one of six conditions, which vary by which additional summary information (based on a training sample of test-takers that are not shown to subjects in the experiment) they receive for each of the test-takers they review:

Condition 0−: Participants receive average score of all training test-takers.

Condition 0+: Participants receive the average score of test-takers in the training sample who share g with the profile they are evaluating.

Condition 1−: Participants receive the average score of test-takers in the training sample who share the same age (x1) range with the profile they are evaluating.

Condition 1+: Participants receive the average score of test-takers in the training sample who share the same age (x1) range and g with the profile they are evaluating.

Condition 2−: Participants receive the average score of test-takers in the training sample who share the same age (x1) range and x2 with the profile they are evaluating.

Condition 2+: Participants receive the average score of test-takers in the training sample who share the same age (x1) range and x2 and g with the profile they are evaluating.

After submitting their evaluation, participants will answer questions about their beliefs in their own prediction abilities as well as to what extent they adjusted their predictions in response to prediction provided (their trust in that signal). Participants will also fill in their own gender g and covariates x1 and x2 to allow us to check for in-group biases. Each condition is also divided into two sets (same across conditions). One set is a gender flip of the other so we obtain balanced groups when comparing the evaluation of women and men. Target sample sizes (subjects) are 375 in 1+ and 1− each, and 125 in each of the others.
Experimental Design Details
Randomization Method
Randomization is on the subject level. Randomization is handled by Qualtrics according to predefined ratios. Three times as many observations are given to our ‘main’ treatment and control as our ‘secondary’ treatments and controls, which replicate the ’main’ trial
while varying the contextual information the algorithm uses. Question order will also be randomized within each treatment.

Evaluation profiles were selected randomly offline via a python notebook.
Randomization Unit
Treatment is assigned on the subject level. All statistical tests are clustered at the subject level.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
1,250 Subjects
Sample size: planned number of observations
1,250 Subjects
Sample size (or number of clusters) by treatment arms
125 subjects in condition 0-

125 subjects in condition 0+

375 subjects in condition 1-

375 subjects in condition 1+

125 subjects in condition 2-

125 subjects in condition 2+
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Our pilot data suggests that to achieve 5% size and 90% power to detect an effect of 6 percentage points for testing our primary null hypothesis that adding gender information does not increase the relative evaluation of women, we need 750 observations total in condition 1 (groups 1+ and 1−) if we were to use that condition alone. We plan to utilize and additional 500 observations (125 in conditions 0+, 0−, 2+, 2− each) to increase power and to observe how the treatment effect varies as the amount of additional contextual information in the assistant both increases and decreases. This leads to a sample size of 1250 subjects
IRB

Institutional Review Boards (IRBs)

IRB Name
Stanford University Internal Review Board
IRB Approval Date
2022-10-10
IRB Approval Number
66199
IRB Name
Columbia University Internal Review Board
IRB Approval Date
2022-10-07
IRB Approval Number
AAAU3601

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials