Going Beyond the Mean: How Gender Affects the Distribution of Evaluations and How the Distribution of Evaluations Affects Hiring Decisions

Last registered on August 10, 2023

Pre-Trial

General Information

Title

RCT ID

AEARCTR-0011877

Initial registration date

August 02, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

August 10, 2023, 12:59 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

United States of America

Region

Primary Investigator

Name

Mallory Avery

Affiliation

Monash University

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Andreas Leibbrandt

PI Affiliation

Monash University

Contact Investigator

PI Name

Joseph Vecci

PI Affiliation

University of Gothenburg

Contact Investigator

Additional Trial Information

Status

In development

Start date

2023-08-15

End date

2024-08-15

Keywords

Behavior, Gender, Labor

Additional Keywords

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

In this project, we investigate the impact of various distributional metrics of job applicants on the evaluation of applicants and their likelihood of being selected for hiring. We also study whether these distributional metrics have similar effects depending on the gender of the applicants and whether the gender of the applicants is known.

External Link(s)

Registration Citation

Citation

Avery, Mallory, Andreas Leibbrandt and Joseph Vecci. 2023. "Going Beyond the Mean: How Gender Affects the Distribution of Evaluations and How the Distribution of Evaluations Affects Hiring Decisions." AEA RCT Registry. August 10. https://doi.org/10.1257/rct.11877-1.0

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

In this project, we conduct multiple interventions to investigate the impact of various distributional metrics of job applicants on the evaluation of applicants and their likelihood of being selected for hiring. We also study whether these distributional metrics have similar effects depending on the gender of the applicants and whether the gender of the applicants is known.

Intervention Start Date

2023-08-15

Intervention End Date

2024-08-15

Primary Outcomes

Primary Outcomes (end points)

We collect the following primary outcomes:
- Choose to hire: this is defined as the applicant, out of a pair, that is chosen by the evaluator to be recommended for hiring by the evaluator

Primary Outcomes (explanation)

In an earlier survey study, we found evidence that larger means and lower variance and minimums, relative to the other applicant, led to increased chance of being hired. We thus hypothesize that these patterns will follow in the experiment:

Hypothesis 1a: When mean differences are small metrics other than the mean (i.e. variance, range, maximum, minimum, outliers, skew) will have predictive power when determining which out of a pair of applicants is chosen to be hired.

Hypothesis 1b: When mean differences are small an applicant having a higher mean or maximum will increase their likelihood of being hired, while a higher variance will decrease their likelihood of being hired.

Small mean differences are defined based off of the three evaluation sets per pair that generated the smallest mean difference while retaining trade-offs between the two applicants, i.e. that one applicant did not strictly or weakly dominate the other applicant in terms of evaluations. The largest mean difference in our sample is 5.33 out of a possible range of 0-100. Because we focus on this case of smaller mean differences, we acknowledge that our results may not be generalizable to cases where the difference in means of the evaluations are more substantial.

Furthermore, based on this earlier evidence, we hypothesize the following treatment interactions:

Hypothesis 2: The predictive power of the evaluation metrics will be diminished when gender is known, compared to when gender is not known.

Hypothesis 3: When gender is known conditional on mean, we expect that metrics will have a less positive (or more negative) impact on hiring decisions for women than for men.

Hypothesis 4: The benefit (cost) from having a higher (lower) mean and a lower (higher) variance will be amplified for men (women) in mixed-gender pairs relative to when gender is not known.

Secondary Outcomes

Secondary Outcomes (end points)

To understand possible mechanisms we elicit the following secondary outcome variables:
-Time spent: as a proxy for attention, we will measure how long evaluators spend on each pair. Greater time spent will be related to greater attention spent on making the right decision.
-Gender: we will measure the likelihood that the chosen individual is female given whether gender is known and the gender composition of the pair
-Quality: we will measure the relative quality of the applicant that is chosen, as measured by their qualifications, average evaluation scores from all evaluations given, and the AI-generated scores assigned to each applicant.

Secondary Outcomes (explanation)

See above

Experimental Design

Our design aims to measure the impact of the distribution of applicants’ evaluations when making hiring decisions.

Experimental Design Details

In this study, we will provide expert subjects, called evaluators, with information about applicants collected in Avery et al. (2023). They will be told, truthfully, that their decisions will help us decide to whom to offer the position.

Evaluators will be recruited to perform a freelance hiring activity. As this is a natural field experiment, they will not be told that they are in an experiment. Each evaluator will be randomized into one of two treatments: gender known or gender unknown. Each evaluator will be provided with information about the job and the recruitment process. Then, evaluators will be provided with a series of three pairs of applicants. Two of these pairs will be mixed gender, while one will have both applicants be male. For each pair, their job will be to choose who should be considered for the position. For each applicant, they will be provided with the following information: the applicant’s years of experience, their education (whether at least a university degree or not), where they learned coding, what coding languages they know, their answers to four interview questions, and 3 evaluations provided by other evaluators as a part of Avery et al. (2023). In addition to the above information, evaluators in the Gender-Known treatment will also be provided with the first name and last initial of the applicants. The ordering of the applicants on the screen will be determined randomly. The exact same pairs will be shown in the gender know and gender unknown treatments.

In a given pair, the evaluation scores of the applicants, referred to as the set of evaluations, may vary across evaluators. For example, in evaluation set 1, Applicant 1 could have scores of 10, 20, and 30, while Applicant 2 could have scores of 20, 80, and 100. In evaluation set 2, the scores could be different, such as Applicant 1: 80, 20, and 30, and Applicant 2: 70, 80, and 50. This design enables us to examine how evaluators rate the same applicant when the evaluation metrics, such as variance, differ. All evaluations shown will be real evaluations given to that applicant, we use the fact that applicants receive at least 3 evaluations, allowing us to vary the evaluation scores shown to each evaluator.

For each pair of applicants, evaluators will choose which they recommend to be hired. Then, they will be presented with the applicant they chose from all three pairs and asked, between those three, which they would recommend most. While this data will not be used in the primary analysis, it is to justify the presentation of the three pairs rather than one.

To select the pairs of applicants we use the following procedure:
• Starting with the sample of applicants from Avery et al (2023) we first drop individuals with non-typical Western sounding names, those with gender-neutral names, and those who have less than 3 evaluations.
• We also drop those applicants where the answers to the interview questions were unusually short in length or too long, relative to average.
• We then focus on the sample where the average of all evaluation scores put them into the top part of the distribution, as this is the sample from which the selected applicant is likely to come from.
• We then created two groups of pairs: exact match pairs, where the CVs of the two applicants within the pair are identical or very similar; and trade-off pairs, where the CVs of the two applicants within the pair are close in quality but differ on areas of relative strength (e.g., one applicant might have higher education, while the other would have more years of experience).
• For each type of pairs, we have one male-male set and the remainder are mixed-gender pairs.
• For each pair we create 3 sets of evaluation scores. We aimed to select sets of scores that have a low mean difference between pairs but also where there is variation in the variance, where possible. We also aimed to generate a low correlation between variance and mean.
• In total we selected 10 pairs and 3 evaluation sets per pair.

In order to not disadvantage the subjects who were not chosen to be in the pairs, we will use a system similar to that of Kessler et al. (2019) to use the decisions made by our evaluators over the sample pairs to identify the applicants in the full sample that would be chosen by the evaluators. We will then invite applicants who are predicted to be selected to be invited for further interview and from there to be hired.

Our sample will be taken from Mturk, Prolific and, if viable, UpWork.We plan collect 45% of our sample from Mturk, 45% from Prolific and 10% from Upwork although the latter depends on our ability to recruit viable evaluators.

References
Kessler, Judd B., Corinne Low, and Colin D. Sullivan. "Incentivized resume rating: Eliciting employer preferences without deception." American Economic Review 109, no. 11 (2019): 3713-3744.

Randomization Method

Randomization will be carried out by a computer.

Randomization Unit

There are multiple forms of randomization:
1) Randomization into the Gender-Known and No-Gender treatments will be across subjects. Thus, this will be at the individual level.
2) We randomize the pairs shown to each evaluator. That is what pair is shown to each evaluator take place at the individual evaluator level.
3) For each evaluator we also randomize the set of evaluation scores shown. While the evaluation sets are pre-determined what evaluation set is shown to each individual is randomized at the individual evaluator level.

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

See above, in most cases randomization occurs at the individual level, as such we cluster at the individual level.

Sample size: planned number of observations

Each evaluator is shown 3 pairs. We have 10 pairs of applicants; this means there are 30 possible combination (10*3) that could be shown to evaluators. If we collect 500 evaluators who each see 3 pairs, we can at most have 1500 possible observations, we then divide this by 2 for the gender known and gender not known treatment which gives us 750 obs. This means we have 25 observations for each possible pair combination (750/30 = 25). Total number of evaluator observations will be 1500.

Sample size (or number of clusters) by treatment arms

Gender Known: 750
Gender Unknown: 750

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Assuming the proportion of the study population that would have a value of 1 for the binary outcome in the absence of the treatment is 0.50 and conservatively treating each evaluator as an individual observation, we have a MDE of 0.11 for our main outcome of “Choose to hire”.

Supporting Documents and Materials

IRB