"Addressing the Mismeasure of Women: an Experimental analysis of Biased Proxies and Biased Beliefs"

Last registered on July 28, 2025

Pre-Trial

Trial Information

General Information

Title
"Addressing the Mismeasure of Women: an Experimental analysis of Biased Proxies and Biased Beliefs"
RCT ID
AEARCTR-0014564
Initial registration date
July 27, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
July 28, 2025, 9:40 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
London School of Economics and Political Science

Other Primary Investigator(s)

Additional Trial Information

Status
In development
Start date
2025-07-28
End date
2025-10-15
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
How can we de-bias estimates of highly skilled women’s ability in hiring and promotion decisions? What interventions are effective at improving people’s ability to evaluate the relative ability of women and men for positions of power? To investigate these questions, I use an online experiment on Indian MTurk users to identify how they make incentivized hiring decisions when presented with limited information, including a biased but informative proxy variable for ability with distributions that differ by gender. I look at how people learn across repeated decisions, and how participants’ previous exposure to women in positions of power or social interactions influences participants’ ability to learn about cross-group differences. For my central interventions, I analyse the extent to which information and active debiasing methods work to address these issues. I further identify whether the anchoring-and-adjustment heuristic impedes learning and whether active debiasing serves to overcome it. In addition, I compare how learning differs when people are aware of the job candidate's gender versus gender-blind hiring practices. This experiment aims to improve our understanding of the extent to which people fail to understand differential signal mappings across sub-groups, the impact of treatments to enhance that understanding, whether people can learn through repeated exposure, or whether differential signal neglect is a persistent phenomenon even with incentives to learn.

I theorize that people exhibit a "differential signal neglect" more broadly — they fail to recognize that the same observable score predicts different outcomes for different groups. Even when given information about bias patterns or adjustment tools, decision-makers anchor on mean differences and insufficiently adjust for the full complexity of group-specific signal-to-outcome mappings. This learning barrier parallels systematic biases in econometric practice where researchers control for group means while ignoring higher-order distributional differences. Relatedly, widespread research methods in various strands of literature in economic, political, and statistical sciences incorrectly incorporate actual differences (such as in cases of differential selection) into the adjustment and fail to provide unbiased estimates.
I argue that existing methods to identify the impact of policies on ability apply incorrect inference procedures, which lead to inconsistent estimates of women's ability relative to men. Incomplete or inaccurate adjustment procedures are especially concerning as we most underestimate the impact of gender-based affirmative action policies when they are most effective. Failure to adjust for distributional disparity distorts policy analyses even more because of the similar failure to account for selection in many applied economic inference approaches. Together, these misestimation tendencies lead us to systematically underestimate the benefits of quotas and affirmative action policies by incorrectly incorporating the average difference in ability due to selection into the gender dummy.
It seems plausible that decision-makers in applied economic settings would display a similar bias. Failure to learn about the disparity in the relationship between actual productivity and interview scores is especially likely, as these decision-makers (relative to researchers) have less exposure to distributions and have less salient information shocks. This bias may influence regulatory and fiscal policy decisions, including gender quota implementations and evaluations, affirmative action plans, and assessment of educational initiatives, as well as firm hiring, university-level admittance decisions, and court conviction decisions.
External Link(s)

Registration Citation

Citation
Ward-Griffin, Peter. 2025. ""Addressing the Mismeasure of Women: an Experimental analysis of Biased Proxies and Biased Beliefs"." AEA RCT Registry. July 28. https://doi.org/10.1257/rct.14564-1.0
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
MTurk Workers complete an incentivized simulated hiring task with limited information, including a relevant but biased proxy variable for the ability of candidates. This scenario is common in real life hiring decisions and in voting behaviour. They are randomly treated with different levels of additional information about the bias over time and receive feedback based on their previous decisions. They also complete a detailed survey task, including opinions about fairness in hiring and previous job experience. The “main treatments” are included in two treatment categories: information provision and active debiasing. Two levels of these treatments and five other treatments are used for disentangling important effects. These will be described in detail upon trial completion. This study aims to evaluate how people learn and make choice decisions in limited information environments with informative but biased signals of latent ability and with different mappings across sub-groups. I will publish more details about the interventions on trial completion. These are pre-registered in the hidden section below.
Intervention (Hidden)
I have designed a diverse set of treatments to identify the effects of information provision or more active debiasing practices on efficiency, learning, and equity. In addition, the set of treatments are designed to identify the difference in participant learning and decision-making quality when gender is known (versus hidden).
To understand the interventions, it is necessary to understand the central task of the study: the comparison task.
After seeing some initial information (10 simulated draws of male hires and 10 simulated draws of female hires illustrating the mapping between observable (interview score) and latent variable (productivity)), candidates begin the main task.
Each participant completes 40 pairwise decisions in rounds of five. Upon completion of each round, they receive information updates about the last five participants hired. In this task, they are presented with a table of five job vacancies, each with two candidates (one man and one woman) and an associated interview score for each. They must choose between the two. After each set of five, they receive feedback on all five of their selections.
In the baseline control, the participants are simply presented with interview scores from Interviewers A and B (gender is unknown). They are presented with no further information from which to make a decision.
The “main treatments” consist of two categories: information provision and active debiasing. These are connected to the main task in which participants make a binary choice between a male and a female candidate for a particular hiring decision.
The information provision treatment consists of either mean or linear information. In the mean information condition, participants are provided with information about the population mean discrepancy between interview score and productivity for men and women. This is supplemented with a table identifying the disparity between the interview score and productivity when the interview score is 50.
The higher level of treatment intensity consists of imperfect linear information provision. In this condition, participants receive information about the mapping between interview score and productivity for men and women or multiple information shocks covering interview scores of 25, 50, and 75, along with information about the discrepancy between interview score and productivity at each level. The measurement disparity information is presented in text before the table of five candidates used for binary decision-making.
Active debiasing consists of providing interview scores that have been adjusted to incorporate the population-level discrepancy information shown above. That is, it is a marginal treatment upon the information above in which participants receive adjusted interview scores presented immediately next to interview scores to try and better illustrate the disparity and reduce the barrier to using information.
The information provision is provided once to most of those treated to represent a standard treatment frequency for learning about biases. In contrast, the debiasing treatment remains for each subsequent decision. To better disentangle the effect of repeated information from de-biasing, some participants are randomly assigned to a repeated information sub-group (they see the information provided above every five decisions).
Knowledge of applicant gender is randomized-- participants either believe they are simply comparing interviewers’ interviewing at colleges A or B, or are made aware that each college is gender-specific (i.e., male-only or female-only). This is used to test whether gender knowledge distorts learning (e.g., by matching with existing biases) or improves the salience of information distortions, allowing for better learning.
The design incorporates progressive information revelation, where participants may receive additional information regarding discrepancies between interview scores and productivity across interviewers or debiasing tools as they progress through the task. Participants advance to conditions with more details or stronger debiasing methods and never regress. A JavaScript variable sequence of categories gives this transition (OriginCat: Original Category, SecondCat: Second Category…) with initial probability weight given by a weighted JavaScript random number generator and then subsequent uniform re-distribution to (weakly) higher categories based on a random variable exceeding the specified threshold. Transition probabilities are:
After decisions 1-10: 65% remain in current condition, 35% may advance
After decisions 11-20: 65% remain in current condition, 35% may advance
After decisions 21-30: 40% remain in current condition, 60% may advance
50% of candidates complete a pre-task consisting of guessing the productivity of candidates based on the interview score. These scores are drawn from the same distribution as one of the genders shown in the comparison task. This aims to identify learning speed within interviewers as opposed to cross-interviewer differences. Any information treatment assigned for the first 10 candidates in the comparison task is included here. Some (50% of the candidates without adjustment in the first 10 choices) of the candidates receive feedback for a group of five (similar to the comparison task) for the first five candidates to identify differential learning speed in singleton versus five-at-a-time information sets. However, most are assigned to individual feedback settings to better identify learning in the guessing task.
-----------
The details of this task (the exact text of the task) are available by contacting the author (they did not fit in this text box)
----------------------------------------------------------------
Other treatment conditions of importance include (cross-randomized):
50% of participants will receive an interviewer-demand-effect shock stating the importance of gender to the researcher. This is intended to be used to bound the effects of the bias by identifying the extent to which people differentially respond in attitude questions and differentially select men (vs women) in the gender known condition versus gender unknown, if the interviewer demand effect is made salient
Similarly, 50% of participants will complete a survey about gender-related attitudes and justice ahead of the survey (versus after) to identify the effects of study participation on stated attitudes to detect bias and better evaluate how these attitudes shape learning.
50% of participants will complete the 4-question Raven’s Matrices task before doing the main comparison task. This is used to better isolate the relation between ability in the pattern recognition task and selection quality.
For better external validity, the following randomizations are added:
Participants have randomized productivity correlation coefficients between candidates within job postings (distributed from 0.3 to 0.7) to identify differential learning. These correlations cover a wide range of plausible real-world ability correlations between applicants.
Participants receive a signal-to-noise ratio (loosely defined, the actual mathematical expression in the paper is more complicated) of 0.4-0.6. Some interviewers are effective at evaluating ability, while others are ineffective, and this range covers much of the estimated range of plausible quality detection among experienced interviewers.
Participants have a randomized mean gender bias parameter for men of 2-9 (i.e., interviewers overestimate male productivity on average by 2-9 points, with greater overestimation for high-performance men and less for low-performance men).
Participants have a randomized mean gender bias parameter for women of 2-8 (i.e., interviewers underestimate female productivity on average by 2-8 points, with greater underestimation for high-performance women and less for low-performance women).
The mean job-level productivity is randomized (50-60), and the standard deviation is randomized.
Intervention Start Date
2025-07-28
Intervention End Date
2025-08-08

Primary Outcomes

Primary Outcomes (end points)
There are three families of primary outcomes, although the third set is of distinctly lower importance. The (a) tasks are of the highest importance. The (b) tasks are still within the primary outcome family but are somewhat less central (but still of higher importance than the outcomes listed in secondary outcomes). In cases where (b) uses data different from (a) within a family (e.g., the comparison vs guessing tasks), the family-wise adjustment will be done separately on sets a and b.
Family 1: Efficiency
1a) For each of the 40 decisions in the main comparison task:
1) Did the participant correctly choose the candidate with higher productivity? (“CorrectDecision”)
2) What was the productivity of the chosen candidate? (“SelectedProductivity”)
3) Did the candidate choose what a Bayesian would (or a similar optimal decision rule) (“BayesianMatch”)
1b) For each of the 10 productivity guesses in the pre-task:
1) How far was the guess from the actual productivity? (“GuessDiff”)
2) How far was the guess from the optimal guess based on prior knowledge? (“BayesianGuessDiff”)
Family 2: Equity
Note: 2a)-1 Is the primary outcome of focus for this family and will be considered separately before evaluating it jointly with outcomes 2 and 3. For each of the 40 decisions in the main comparison task:
2a)
1) Did the participant choose a man or a woman to hire? (“SelectedMan”)
2) Did this differ from the optimal gender to hire in a particular comparison? (Separated into “SelectedManCorrect”,”SelectedWomanCorrect”)
3) Did the participant’s choice of man or woman differ from a Bayesian choice? (SelectedManBayesian;SelectedWomanBayesian)
2b) For each of the 10 productivity guesses in the pre-task:
Did participants assigned to women systematically underestimate their ability? (WomenScoreDiffSigned)
Did participants assigned to men systematically overestimate their ability? (ManScoreDiffSigned)
Family 3: Attitudes and non-task Performance
All “outcomes” here have a dual purpose as inputs into the above regressions.
3a)
How well did the participant do on a 4 Raven’s Matrices subtest? (Raven-4-score)
How sexist were participant responses on the Old-fashioned Sexism scale?
How sexist were participant responses on a combined set of other sexism responses?
3b)
Participant responses to Internal-External Locus of Control Scale
Two Just World scale items and associated measures
Primary Outcomes (explanation)
The correct decision variable is a binary descriptor of whether the selected candidate’s productivity is higher than the unobserved other candidate’s productivity
Selected Productivity is simply the productivity score (on a scale from 0-100) of the chosen candidate in a particular pair.
BayesianMatch indicates whether the participant’s response matches what a Bayesian would have done with the prior information available
GuessDiff represents the absolute difference in guessed score (abs(TrueProd-GuessedProd)), where TrueProd is the true productivity of a candidate in the guessing task, and GuessedProd is the participant’s guess of productivity after seeing the interview score.
BayesianGuessDiff represents the difference in guessed score relative to optimal Bayesian learning (abs(BayesProd-GuessedProd)), where BayesProd is simply the optimal Bayesian productivity guess.
SelectedMan is a binary indicator for the comparison task, indicating whether the participant selected a man in a particular pair.
SelectedManCorrect is a binary indicator equal to 1 if the candidate selected a man when the man had higher productivity.
SelectedWomanCorrect is a binary indicator equal to 1 if the candidate selected a woman when the woman had higher productivity.
SelectedManBayesian is a binary indicator equal to 1 if the candidate selected a man when a Bayesian would have believed a man had higher productivity.
SelectedWomanBayesian is a binary indicator equal to 1 if the candidate selected a woman when a Bayesian would have believed that a woman had higher productivity.
WomenScoreDiffSigned is equal to the actual productivity of a woman in the guessing task minus the guessed value.
MenScoreDiffSigned is equal to the actual productivity of a man in the guessing task minus the guessed value.
Raven-4-Score is equal to the number (0-4) of times a participant correctly guessed the missing piece in Raven’s Matrices.
Old-fashioned Sexism is a re-scaling to a five-point scale of the existing 7-point scale and consists of the following question variables: WomenNoSmart WomenBossComfort EncBoysVsGirlsAth WomenLogic CallMotherSick. Each is scored from 1 (Strongly Disagree) to 5 (Strongly Agree).
Here is the phrasing used:
WomenNoSmart: “Women are generally not as smart as men.”
WomenBossComfort: “Having a woman as a boss does not make me uncomfortable.”
EncBoysVsGirlsAth: “It is more important to encourage boys than to encourage girls to participate in athletics.”
WomenLogic: “Women are just as capable of thinking logically as men.”
CallMotherSick: “When both parents are employed, and if their child gets sick at school, the school should call the mother rather than the father.”
OtherSexism is also set to a five-point scale and consists of the following 9 items: BoyPreferred BoysOutperfSchool WomObeyHusb WomenNoPolitics WomenTakeCareFam MenBreadwinner ManFinalSay WomenEqualLeadership WomenEqualOpp


BoyPreferred: “If one could have only one child, having a boy over a girl is preferable.”
BoysOutperfSchool: “Boys outperform girls in schools.”
WomObeyHusb: “A woman should obey her husband in all matters.”
WomenNoPolitics: “Women should not be involved in politics as much as men.”
WomenTakeCareFam: “A woman’s role is to take care of her home and family.”
MenBreadwinner “Men should be the ones who bring money home for the family, not women”.
ManFinalSay “A man should have the final word about decisions in the home.”
WomenEqualLeadership “Women are as capable leaders as men.”
WomenEqualOpp “Women have equal opportunities in India.”
The Internal-External Locus of Control short scale is a four item scale.
The four questions used are given by variable names OwnBoss WorkSuccess OthersDetermine FateInterfere and each scored on a five point scale (with scale realignment using NotOthersDetermine=6-OthersDetermine and NoFateInterfere=6-FateInterfere)
As above, InternalLocus=(OwnBoss+WorkSuccess+NotOthersDetermine+NoFateInterfere)/4
OwnBoss: “I am my own boss.”
WorkSuccess “If I work hard, I will succeed.”
OthersDetermine “Whether at work or in my private life, what I do is mainly determined by others.”
FateInterfere “Fate often gets in the way of my plans”

Secondary Outcomes

Secondary Outcomes (end points)
As a secondary analysis, I will take each individual-task-level outcome and evaluate the clustered units as a single outcome to provide an alternative approach to assess learning over time. I will also examine whether variables intended as pure controls (e.g., exposure to women) are shaped by treatment conditions. If they are, failing to control these outcome shifts could bias the primary endpoints discussed above. Specific secondary outcomes of interest include evaluating the first five or 10 comparison decisions as a unit and evaluating simple treatment effects without incorporating learning over time.
Other secondary outcomes of interest include the speed at which participants complete the Raven’s matrices tasks (especially if getting them right).

Outside of secondary outcomes, it is essential to note that there are secondary analyses (or planned supplemental analyses) of the primary outcomes listed above. While not explicitly defined as an outcome, it is essential to note that in many cases, a central estimand of interest is learning (or learning speed). This is identified by evaluating the extent to which productivity, equity, and decision-making improve across position-numbers (or lack thereof) and how this speed compares with a Bayesian. In addition, many analyses will focus on how treatment shifts productivity and equity levels versus slopes.
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
Participants (Indian MTurk workers aged 25-65) complete a 40-question binary choice task designed to simulate comparison-based hiring tasks using interview scores.
For this experiment, I am focusing on numerical interview scores. They are a very widely used score in employment, education, and training settings, and they have two critical features:
1) They are an informative signal of ability
2) They often have significant differences in the sub-group interview score distributions, sometimes even when performance is equal
The experiment evaluates the effect of information, salience, and other treatments on participants’ decision-making efficiency and equity. All randomization is conducted using JavaScript and Qualtrics’ internal randomization measures. Participants receive individual choice-level feedback about their decision quality every five decisions, and learning is evaluated.
Participants experience a series of cross-randomized information shocks that may influence how they make decisions. This experiment is designed to randomize the salience of information and sub-group category knowledge, as well as population- or sample-level information shocks. The impact of the interaction of the shocks with each other and vital participant characteristics is expected to shape how people make decisions, the proportion of correct choices they make, and may affect critical equity-related outcomes.
Separately, treatment conditions' impact and order effects are tested to ensure unbiased estimation of learning and participant outcomes.
Half of the participants will also complete a 10-question productivity guessing task. Participants assigned an initial information treatment or gender knowledge are also assigned the same treatments for the guessing task.
Participants:
N = 1,600 MTurk workers from India
Age range: 25-65 years
Main Task Structure:
Receive information about the past 10 hires from each group
If in (subgroup knowledge) condition: know they are seeing 10 past hires of (subgroup of interest) versus (second subgroup of interest)
Otherwise, they simply know they are comparing across candidates from college A at a university or college B.
Complete 40 Sequential Choices across 8 rounds:
Each round: 5 simultaneous comparisons with binary decisions
Choose the higher-productivity candidate from each pair
After each 5-decision round: See chosen candidates' true productivities
Participants receive no feedback on unchosen candidates.
More details will be released at the study’s completion.
Experimental Design Details
One of the most critical roles of econometrics is to help economists uncover estimates of latent (unobserved) outcomes of interest. However, even econometricians systematically misestimate people's ability across sub-groups and misevaluate the best estimate of a latent variable (such as productivity) given an observable input (such as a hiring score). This experiment evaluates whether people ignore critical aspects of sub-group differences in signal-to-latent variable mapping distributions (i.e., the best prediction of a given latent variable given an observable). This experiment is designed to supplement a theoretical contribution to econometrics based on observable disparity with experimental testing of a population parallel. Subgroups often have substantially different mappings from latent variables to observables. However, commonly used methods to predict latent variables using observables often apply inaccurate corrections. Some of the clearest examples come from strands of literature on evaluating ability. We can apply more advanced econometric methods to address much of this bias and improve economists’ ability to assess policies. However, there is little reason to believe that the failure to adjust for subgroup distributional differences properly is solely a statistical issue affecting researchers. Indeed, real-world decision problems appear to have similar challenges.
Predicting productivity or ability using potentially biased proxies is a frequent real-world decision problem. Numerical interview scores are informative and are essential to admission and hiring decisions. The use of interview scores is pervasive and they are used across the world, in schools, universities, medical positions, financial institutions, and various other industries. However, they often systematically differ across subgroups (such as gender, age, or race).
I theorize that people exhibit a "differential signal neglect" more broadly — they fail to recognize that the same observable score predicts different outcomes for different groups. Even when given information about bias patterns or adjustment tools, decision-makers anchor on mean differences and insufficiently adjust for the full complexity of group-specific signal-to-outcome mappings. This parallels systematic biases in econometric practice where researchers control for group means while ignoring higher-order distributional differences. Relatedly, widespread research methods in various strands of literature in economic, political, and statistical sciences incorrectly incorporate actual differences (such as in cases of differential selection) into the adjustment and fail to provide unbiased estimates.
In this experiment, I test the ability of people to use simulated interview scores to evaluate expected productivity across different subgroups (genders) in an incentivized repeated decision problem. In the main task, participants complete 40 binary decisions. In each decision, they hire either a man or a woman after observing the interview scores assigned by a pair of biased interviewers. After every five decisions, they receive information about their selections. Some participants know they are comparing across genders, while others simply believe they are hiring across unspecified colleges.
I evaluate how well information corrections about mean or linear differences between the mapping between interview scores and productivity improve decisions. I compare them to more active score de-biasing (mean or linear). I compare people’s baseline learning rate with a Bayesian model and evaluate how effectively treatments improve people’s decision-making ability and impact worker productivity and equity. I also assess the role of first performing a learning-by-guessing task where participants predict productivity based on interview scores to compare within-gender and across-gender learning behaviour and assess how treatment conditions shape within-gender learning. Through this experiment, I aim to understand how people evaluate others in crucial decision problems based on their prior experience and limited information consisting of biased but relevant proxies. I aim to understand the extent to which we can improve human decision-making in both efficiency and equity terms using very simple interventions, and the extent to which human decision-makers continue to neglect important outcome-relevant history.
This experiment has four components for participants: 1) an attention check (simply used to detect bots, people not paying attention, etc), 2) a productivity estimation task that 50% of participants complete, 3) a hiring decision task (the main experimental task) and 4) a survey and a mental ability estimation task.
For this experiment, I am focusing on numerical interview scores. Despite their importance and prevalence, it is plausible that decision-makers often fail to account for how these scores differentially map to actual performance across demographic sub-groups. This experiment tests whether people can learn and adjust for these differential mappings or whether they remain anchored on the proxy distribution.
Given the observed disparity in cross-group hiring outcomes, I focus on comparisons of men and women. Their cross-group ability comparisons are also frequently misestimated in economics using other proxies, such as mincer residuals, even with gender dummies.
I use interview scores with gender bias, and with higher gender bias at high productivity levels, to evaluate whether participants can effectively disentangle sub-group differences in observable to latent mapping distributions.
A primary goal of this experiment is to evaluate learning rather than simply assessing baseline ability to predict gender bias. As such, participants will each complete 40 binary decision tasks deciding whether to recommend hiring the male or female applicant based on their interview score. They receive feedback about their performance every five decisions (which directly affects their payment).
Key questions include: How does demonstrated learning compare with optimal or Bayesian learning? To what extent does failure to learn (relative to a Bayesian) in incentivized tasks impact correct decisions among binary pairs? What is the impact of learning speed discrepancies on the productivity of hired workers? How do biased signals affect selecting women (vs men)? Then, relatedly, what can we do to improve decision-making or learning? How effective is providing information about disparities in the mapping between observable and latent variables? What about more active forms of de-biasing (incorporating the provided information about disparities into the scores displayed to participants)? How does the effectiveness of information or debiasing differ depending on the timing of the intervention?
Participants:
N = 1,600 MTurk workers from India
Age range: 25-65 years
Rationale: India's labor market features numerical assessment systems for interview scores (with substantial influence on career trajectories) and significant gender disparities, making this population particularly relevant. Interviewers and managers are rarely hired before the age of 25, so this is the most appropriate age band.
Each participant completes 40 comparisons in the main comparison task, and half complete 10 productivity guessing questions before the main comparison task
Specifically, this experiment consists of 9 binary treatments. I focus on four main treatments categorized into two treatment types (information or active debiasing), one condition type essential for better understanding the differential impact of other treatments (gender knowledge), one condition type to better improve our modeling of learning (the guessing pre-task) and three condition types that are primarily used for supplementing our theoretical understanding and evaluating salience effects.
The main treatments consist of information provision and active debiasing, each having two treatment levels (excluding controls). These are described elsewhere.
Gender knowledge is randomized.
The design incorporates progressive information revelation, described above.
50% of candidates complete a pre-task consisting of guessing the productivity of candidates based on the interview score. These scores are drawn from the same distribution as one of the genders shown in the comparison task. This aims to identify learning speed within interviewers as opposed to cross-interviewer differences.
50% of participants will receive an interviewer-demand-effect shock stating the importance of gender to the researcher. This is intended to be used to bound the effects of the bias.
Similarly, 50% of participants will complete a survey about gender-related attitudes and justice ahead of the survey (versus after) to identify the effects of study participation on stated attitudes to detect bias and better evaluate how these attitudes shape learning.
50% of participants will complete the 4-question Raven’s Matrices task before doing the main comparison task.
For each main outcome, I will also identify the optimal Bayesian decisions for a particular candidate’s past score history and the optimal treatment effect. The comparison with the set of optimal decision intercepts, slopes, and quadratic estimates of Bayesians with displayed participant learning is of particular importance.


While other randomization components are implemented, they are primarily used for either external validity (threshold analyses, linear and quadratic analyses based on the impact of gender bias size, candidate variance, noise, and position covariance) or for checking position effects (such as MenLeft).
Randomization Method
The primary form of randomization is using JavaScript that runs on the Qualtrics platform for participants from MTurk. Variants of the Math.random() code are used.
Randomization Unit
Randomization is done in a hierarchical structure across and within participants. First, participant groups are assigned a changing group-level for every 10 decisions. Second, participant-level variables are randomized. Third, individual-task elements are randomized.
More precisely, participants are first assigned to individual person-level treatment groups. These include gender knowledge, which is consistent across all comparison tasks and the productivity task, whether they are to complete the guessing task, and the characteristics of the distribution they will be drawing from. This includes population-level gender bias (separated into overestimation of men, underestimation of women, and differential selection), productivity variance, covariance of candidate performance within job types, informativeness of interview scores, candidate average productivity, whether men are on the left or right (tied to whether they are under interviewer A), whether participants receive repeated information, or simply receive information once conditional on receiving information, whether they guess the productivity of candidates interviewed by interviewer A or B in the guessing task, whether they have a social desirability shock, whether they complete the matrices prior to the task or post, and whether they complete the gender-salience survey pre- or post.
Other treatments are randomized at the 10-question cluster level including the main treatments: information about mean discrepancies between interview scores and productivity, (incomplete) information about the slope between interview scores and productivity, and whether participants are provided with adjusted scores that incorporate the information provided to them.
Knowledge is provided at the five-question level, with each participant receiving information about each of their last five hires at the end of each set of five decisions.
At the individual decision level, each participant is influenced by the randomly generated individual interview score for men and women, and aims to evaluate the productivity of each to decide which selection would generate more money for them.
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
There are 1600 people, with 40 choices each, giving a total of 64,000 decisions in 6400 10-decision treatment clusters (and 12,800 five-decision clusters), but analyses will focus on using a mixed effects model with a three-level hierarchy (person>10-decision>individual decision).
Sample size: planned number of observations
64,000 decisions
Sample size (or number of clusters) by treatment arms
For this study, I recruit N = 1,600 participants from Amazon Mechanical Turk workers located in India. Each participant completes 40 binary hiring decisions between paired candidates (one man and one woman), yielding 64,000 total observations with clustering at the individual level. Based on the pilot data, the number of effective observations for decision-level outcomes will range from 16,196 to 54,000, depending on the ICC of a particular outcome. This reflects a design effect of up to 3.95.
Primary Treatment Assignment
Participants are randomly assigned to one of the 12 experimental conditions through weighted randomization. Each category has allocation probabilities ranging from 1/40 to 6/40. These affect the information and debiasing that participants receive for their initial 10 comparison decisions. Information treatments affect the information that participants receive for their first 10 guessing tasks as well. Participants can move between categories to higher information or adjustment levels.
Initial Treatment Distribution:
Conditions Without Gender Information (Categories 1-6):
Condition 1 (No bias information, No population debiasing): 240 participants
Of these, 60 receive sample debiasing, repeating (and summarizing) information based on the past 10 hire information
Condition 2 (Mean bias information, No debiasing): 200 participants
Condition 3 (Linear prediction information, No debiasing): 120 participants
Condition 4 (Mean bias information, Mean-based debiasing): 120 participants
Condition 5 (Linear prediction information, Mean-based debiasing): 80 participants
Condition 6 (Linear prediction information, Linear debiasing): 40 participants
Conditions With Gender Information (Categories 7-12):
Condition 7 (No bias information, No debiasing): 240 participants
Of these, 60 receive sample debiasing, repeating information based on the past 10 hire information
Condition 8 (Mean bias information, No debiasing): 200 participants
Condition 9 (Linear prediction information, No debiasing): 120 participants
Condition 10 (Mean bias information, Mean-based debiasing): 120 participants
Condition 11 (Linear prediction information, Mean-based debiasing): 80 participants
Condition 12 (Linear prediction information, Linear debiasing): 40 participants
Dynamic Information Provision
The design incorporates progressive information revelation, where participants may receive additional information regarding discrepancies between interview scores and productivity across interviewers or debiasing tools as they progress through the task. Participants can only advance to conditions with more information or stronger debiasing methods, never regress. This transition is given by a JavaScript variable sequence of categories (OriginCat - Original Category, SecondCat- Second Category…) with initial probability weight given by a weighted JavaScript random number generator and then subsequent uniform re-distribution to (weakly) higher categories based on a random variable exceeding the specified threshold. Transition probabilities are:

After decisions 1-10: 65% remain in current condition, 35% may advance
After decisions 11-20: 65% remain in current condition, 35% may advance
After decisions 21-30: 40% remain in current condition, 60% may advance
Expected Distribution Over Time:
Condition
Initial (N)
After 10 decisions
After 20 decisions
After 30 decisions
No information/No debiasing
480
340
240
120
Mean information/No debiasing
400
316
248
152
Linear information/No debiasing
240
240
226
190
Mean information/Mean debiasing
240
240
226
190
Linear information/Mean debiasing
160
244
286
346
Linear information/Linear debiasing
80
220
270
500

Additional Experimental Manipulations
1. Gender Salience Priming
Half of the participants (800) will complete the gender-attitude survey questions before the main hiring task to increase the salience of gender considerations. The remaining 800 participants complete these same questions after the hiring task, serving as the control group for demand effects.
At the beginning of the survey, half of the participants (800) will receive an interviewer-demand-inducing prompt to increase gender salience in the task. If they are assigned to the interviewer demand effect condition, it always comes first after completing the consent form.
These are cross-randomized, and we are interested in the interaction (which has condition sizes of approximately 400 participants.
2. Productivity Estimation Training Task
Participants are randomly assigned (800 each) to either:
With training: Complete 10 individual productivity estimation exercises before the main task
Without training: Proceed directly to the main hiring task
Among those receiving the training task, the interviewer assignment is randomized:
800 total participants evaluate candidates prior to the comparison task
400 participants evaluate candidates interviewed by Interviewer A
400 participants evaluate candidates interviewed by Interviewer B
Within each of these, participants also receive information and adjustment levels equal to their initial allocation in the comparison task above
All participants (1600) will receive information about the outcome of candidates in the preliminary productivity guessing task, even if they did not complete it
3. Feedback timing (within guessing task)
For participants in the no-adjustment condition who receive the training task, the task is randomized into two starting conditions:
Approximately 280 participants begin with a joint hiring task of five candidates before receiving feedback to test the impact of multiple delayed signals versus updating after individual hires. Simultaneously, this allows for testing whether bias is reduced or increased in simultaneous hiring tasks. Participants then select individuals one at a time and receive individual feedback from the 6th to 10th candidates.
Approximately 280 participants receive individual feedback from the beginning within the no-adjustment condition
All participants (approximately 240) in the adjustment conditions who are assigned to do the guessing task complete the individual feedback version of the task. This brings the total to 520.
4. Display Position Randomization (800 participants each)
To control for display position effects and identify whether candidate positions cause differential learning:
In the left column, 800 participants see male candidates (labeled position A, interviewer A, etc.)
In the left column, 800 participants see female candidates (labeled position A, interviewer A, etc.)
This treatment condition is purely intended as a control or counterbalance.
5. Cognitive Assessment Timing (800 participants each)
The Raven's Progressive Matrices assessment is administered either:
Before the main task for 800 participants
After the main task for 800 participants
· Participants receive the matrices either 1) just before completing the guessing task (if assigned), 2) just before completing the main task (if not assigned the guessing task), or 3) at the end of the survey. Groups 1 and 2 are combined for analysis (800 people) vs group 3 (800 people).
6. Population vs sample information shock
This is the same category discussed as a subcategory of 1 and 7 in the main treatment randomization shown above. The main purpose of this experiment is to mimic a real-world scenario where external information is used to intervene using broader population-level data. However, to partially separate the impact of improved information and simply information salience, 60 participants in each of the gender known and gender unknown conditions receive a salience shock regarding information already visible to them through the past 10 hires.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
For this section, I focus on the three most important outcomes. I am using the conditional MDE, but will list the marginal estimated MDE for completeness. These estimates are based on a pilot dataset of 85 people and 30 decisions each, accounting for outcome-specific intraclass clustering. These estimates are the threshold for a single treatment to be found significant within an outcome family when a treatment is randomized with equal weights. However, a step-down procedure will be used in the actual data based on a modification of List et al. (2023). CorrectDecision is a binary number (either 0 or 1). The MDE below is the shift in probability from 0 to 1 for an estimator to be found significant. Based on sample data, the estimated effective sample size after accounting for ICC is 36,121 decisions. CorrectDecision: The MDE (Conditional) is 0.0101 (equivalent to a one percentage point increase in probability of a correct decision). The pilot mean (aka the proportion of times participants correctly selected the candidate) is .568 (or .49 in the untreated condition). The MDE represents an approximately 2.1% increase in correct decisions. The MDE(Marginal) is 0.0123. The standard deviation of correct choices is 0.48 (or 48 percentage points) for both marginal and conditional estimates. SelectedProductivity is a numerical entry from 0 to 100 equivalent to the productivity of a selected candidate in a single pairwise comparison. Based on sample data, the effective sample size after accounting for ICC is 23,879 decisions. SelectedProductivity: The MDE (Conditional) is 0.1534 productivity units (equivalent to a 0.15 percentage point increase in productivity). The mean of the intercept is -2 (made possible by residualizing on the mean hiring score displayed). Still, the marginal mean is 51, which provides a more reasonable estimate that the MDE is a 0.297% increase in productivity. The MDE (Marginal) is 0.4173. The standard deviation of SelectedProductivity is: Conditional: 5.99 Marginal: 24.56 SelectedMan is a binary outcome equal to 1 if the participant selected a man in a particular pairwise comparison. The MDE below is the shift in probability from 0 to 1 for an estimator to be found significant. Based on sample data, the effective sample size after accounting for ICC is 16,196. SelectedMan: MDE (Conditional) 0.0137 (equivalent to a 1.4 percentage point increase in probability of selecting a man). The mean of the intercept is .431 (versus an overall mean of 0.54). The MDE corresponds to a 3.18% percent increase in selecting a man relative to the intercept and a 2.54% increase relative to the overall mean. MDE (Marginal) 0.0149. The standard deviation of SelectedMan is: Marginal: 0.4789 Conditional: 0.4390
IRB

Institutional Review Boards (IRBs)

IRB Name
The LSE Research Ethics Committee
IRB Approval Date
2024-08-21
IRB Approval Number
241485
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials