Experimental Design Details
One of the most critical roles of econometrics is to help economists uncover estimates of latent (unobserved) outcomes of interest. However, even econometricians systematically misestimate people's ability across sub-groups and misevaluate the best estimate of a latent variable (such as productivity) given an observable input (such as a hiring score). This experiment evaluates whether people ignore critical aspects of sub-group differences in signal-to-latent variable mapping distributions (i.e., the best prediction of a given latent variable given an observable). This experiment is designed to supplement a theoretical contribution to econometrics based on observable disparity with experimental testing of a population parallel. Subgroups often have substantially different mappings from latent variables to observables. However, commonly used methods to predict latent variables using observables often apply inaccurate corrections. Some of the clearest examples come from strands of literature on evaluating ability. We can apply more advanced econometric methods to address much of this bias and improve economists’ ability to assess policies. However, there is little reason to believe that the failure to adjust for subgroup distributional differences properly is solely a statistical issue affecting researchers. Indeed, real-world decision problems appear to have similar challenges.
Predicting productivity or ability using potentially biased proxies is a frequent real-world decision problem. Numerical interview scores are informative and are essential to admission and hiring decisions. The use of interview scores is pervasive and they are used across the world, in schools, universities, medical positions, financial institutions, and various other industries. However, they often systematically differ across subgroups (such as gender, age, or race).
I theorize that people exhibit a "differential signal neglect" more broadly — they fail to recognize that the same observable score predicts different outcomes for different groups. Even when given information about bias patterns or adjustment tools, decision-makers anchor on mean differences and insufficiently adjust for the full complexity of group-specific signal-to-outcome mappings. This parallels systematic biases in econometric practice where researchers control for group means while ignoring higher-order distributional differences. Relatedly, widespread research methods in various strands of literature in economic, political, and statistical sciences incorrectly incorporate actual differences (such as in cases of differential selection) into the adjustment and fail to provide unbiased estimates.
In this experiment, I test the ability of people to use simulated interview scores to evaluate expected productivity across different subgroups (genders) in an incentivized repeated decision problem. In the main task, participants complete 40 binary decisions. In each decision, they hire either a man or a woman after observing the interview scores assigned by a pair of biased interviewers. After every five decisions, they receive information about their selections. Some participants know they are comparing across genders, while others simply believe they are hiring across unspecified colleges.
I evaluate how well information corrections about mean or linear differences between the mapping between interview scores and productivity improve decisions. I compare them to more active score de-biasing (mean or linear). I compare people’s baseline learning rate with a Bayesian model and evaluate how effectively treatments improve people’s decision-making ability and impact worker productivity and equity. I also assess the role of first performing a learning-by-guessing task where participants predict productivity based on interview scores to compare within-gender and across-gender learning behaviour and assess how treatment conditions shape within-gender learning. Through this experiment, I aim to understand how people evaluate others in crucial decision problems based on their prior experience and limited information consisting of biased but relevant proxies. I aim to understand the extent to which we can improve human decision-making in both efficiency and equity terms using very simple interventions, and the extent to which human decision-makers continue to neglect important outcome-relevant history.
This experiment has four components for participants: 1) an attention check (simply used to detect bots, people not paying attention, etc), 2) a productivity estimation task that 50% of participants complete, 3) a hiring decision task (the main experimental task) and 4) a survey and a mental ability estimation task.
For this experiment, I am focusing on numerical interview scores. Despite their importance and prevalence, it is plausible that decision-makers often fail to account for how these scores differentially map to actual performance across demographic sub-groups. This experiment tests whether people can learn and adjust for these differential mappings or whether they remain anchored on the proxy distribution.
Given the observed disparity in cross-group hiring outcomes, I focus on comparisons of men and women. Their cross-group ability comparisons are also frequently misestimated in economics using other proxies, such as mincer residuals, even with gender dummies.
I use interview scores with gender bias, and with higher gender bias at high productivity levels, to evaluate whether participants can effectively disentangle sub-group differences in observable to latent mapping distributions.
A primary goal of this experiment is to evaluate learning rather than simply assessing baseline ability to predict gender bias. As such, participants will each complete 40 binary decision tasks deciding whether to recommend hiring the male or female applicant based on their interview score. They receive feedback about their performance every five decisions (which directly affects their payment).
Key questions include: How does demonstrated learning compare with optimal or Bayesian learning? To what extent does failure to learn (relative to a Bayesian) in incentivized tasks impact correct decisions among binary pairs? What is the impact of learning speed discrepancies on the productivity of hired workers? How do biased signals affect selecting women (vs men)? Then, relatedly, what can we do to improve decision-making or learning? How effective is providing information about disparities in the mapping between observable and latent variables? What about more active forms of de-biasing (incorporating the provided information about disparities into the scores displayed to participants)? How does the effectiveness of information or debiasing differ depending on the timing of the intervention?
Participants:
N = 1,600 MTurk workers from India
Age range: 25-65 years
Rationale: India's labor market features numerical assessment systems for interview scores (with substantial influence on career trajectories) and significant gender disparities, making this population particularly relevant. Interviewers and managers are rarely hired before the age of 25, so this is the most appropriate age band.
Each participant completes 40 comparisons in the main comparison task, and half complete 10 productivity guessing questions before the main comparison task
Specifically, this experiment consists of 9 binary treatments. I focus on four main treatments categorized into two treatment types (information or active debiasing), one condition type essential for better understanding the differential impact of other treatments (gender knowledge), one condition type to better improve our modeling of learning (the guessing pre-task) and three condition types that are primarily used for supplementing our theoretical understanding and evaluating salience effects.
The main treatments consist of information provision and active debiasing, each having two treatment levels (excluding controls). These are described elsewhere.
Gender knowledge is randomized.
The design incorporates progressive information revelation, described above.
50% of candidates complete a pre-task consisting of guessing the productivity of candidates based on the interview score. These scores are drawn from the same distribution as one of the genders shown in the comparison task. This aims to identify learning speed within interviewers as opposed to cross-interviewer differences.
50% of participants will receive an interviewer-demand-effect shock stating the importance of gender to the researcher. This is intended to be used to bound the effects of the bias.
Similarly, 50% of participants will complete a survey about gender-related attitudes and justice ahead of the survey (versus after) to identify the effects of study participation on stated attitudes to detect bias and better evaluate how these attitudes shape learning.
50% of participants will complete the 4-question Raven’s Matrices task before doing the main comparison task.
For each main outcome, I will also identify the optimal Bayesian decisions for a particular candidate’s past score history and the optimal treatment effect. The comparison with the set of optimal decision intercepts, slopes, and quadratic estimates of Bayesians with displayed participant learning is of particular importance.
While other randomization components are implemented, they are primarily used for either external validity (threshold analyses, linear and quadratic analyses based on the impact of gender bias size, candidate variance, noise, and position covariance) or for checking position effects (such as MenLeft).