Understanding How Information about AI affects Response to AI in the Recruitment Process

Last registered on June 24, 2024

View Trial History

Pre-Trial

Trial Information

General Information

Title

Understanding How Information about AI affects Response to AI in the Recruitment Process

RCT ID

AEARCTR-0013832

Initial registration date

June 18, 2024

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

June 24, 2024, 2:09 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

Australia

Region

Country

United States of America

Region

Primary Investigator

Name

Mallory Avery

Affiliation

Monash University

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Andreas Leibbrandt

PI Affiliation

Monash University

Contact Investigator

Additional Trial Information

Status

In development

Start date

2024-06-19

End date

2024-07-19

Keywords

Behavior, Gender, Labor, Other

Additional Keywords

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

In this project, we study how providing information about AI to candidates and evaluators for a real job affects their application and evaluation behavior.

External Link(s)

Registration Citation

Citation

Avery, Mallory and Andreas Leibbrandt. 2024. "Understanding How Information about AI affects Response to AI in the Recruitment Process." AEA RCT Registry. June 24. https://doi.org/10.1257/rct.13832-1.0

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

In this project, we study how providing information about AI to candidates and evaluators for a real job affects their application and evaluation behavior.

Intervention (Hidden)

Digital innovations and advances in AI have produced a range of job applicant assessment tools. Many of these technologies aim to help organizations improve their ability to find the right person for the right job, faster and cheaper than before. However, there is also great concern about the use of these tools and the impact these tools will have on diversity and bias in recruitment. While no major policies have been put into place to regulate the use of AI in recruitment, such policies are being discussed. A major component of these policies, particularly in “high-risk” contexts like recruitment, is transparency, so that those job applicants or recruiters who interact with AI know it is happening and are aware of the potential for these AI systems to be biased.

To understand the potential impacts of communicating with applicants and recruiters the use of AI and the potential for bias in AI, we design a novel set of experiments. In the first experiment, we collect applicant data by posting a job advertisement for a real web designer job. All candidates must answer a series of interview style questions. We randomly vary the information provided to the applicant about the use of AI for evaluating their interview answers. In other words, we have four treatments:

i) AI-Explained
We inform applicants that AI will be used to evaluate their applications, and we provide a brief, non-technical explanation as to what AI is and how it works.

ii) AI-Bias
We provide the information in AI-Explained, plus a brief disclaimer about the potential for bias in AI.

iii) AI-Debiased
We provide the information in AI-Bias, plus a brief true explanation that the AI provider has taken steps to debias their algorithm.

iv) Human-Oversight
We provide the information in AI-Bias, plus a brief true explanation that there will be human oversight of the AI and that humans will make the final decisions.

All applicants, regardless of treatment, will be evaluated by both humans and AI.

We will also conduct additional treatments on the side of the evaluator to better understand mechanisms and potential solutions. We will decide on these only after the treatments described above are conducted. These will be added in Appendix 1 after the first stage of the experiment.

Intervention Start Date

2024-06-19

Intervention End Date

2024-07-19

Primary Outcomes

Primary Outcomes (end points)

We collect the following primary outcomes:

- Proportion who starts the candidate assessment: This is defined as the number of candidates who start the candidate assessment/ The number who receive the candidate assessment. A candidate is considered to have received the assessment if they are sent an email with details on the assessment.

-Proportion who complete: This is defined as the number who complete the candidate assessment/ The number who receive the candidate assessment. A candidate is considered to have received the assessment if they are sent an email with details on the assessment. A candidate is said to have completed if they answer all questions.

-Overall assessment evaluation score: This is made up of two parts: Overall evaluation of candidates AI score: The AI will evaluate all candidates and score them on a scale between 0-100 (where 0 means low and 100 high). Overall evaluation of candidates Human: Our human evaluators will score all candidates on a scale between 0-100.

Note: Any metrics calculated by the AI algorithm such as, the candidates overall score, are set by an external AI provider and is not influenced by the researchers. The algorithm will not change throughout the project.

Primary Outcomes (explanation)

In all of the following, we will focus primarily on the minority-majority gap (mmg), which we define as the gap in starting, completion, or score between a minority and majority group. These groups can be defined by gender, where women and non-binary people are the minority and men are the majority, race, where underrepresented minorities are the minority and white and Asian people are the majority, or both, where either a gender or race minority category means minority and the remainder are majority. We anticipate that the mmg will favor majorities in all of our measures.

Below we outline the primary comparisons and what conclusions will be drawn from them:

1) AI-Explained vs. AI-Biased: by comparing these two groups, we will evaluate the impact that disclaiming the possible existence of bias in AI has on the mmg.

The mmg may be higher in AI-biased due to greater concern from minorities about the presence of bias in the AI system; on the other hand, the mmg may be lower if the presence of a disclaimer makes minority candidates believe that transparency indicates something positive about the AI provider or employer, such as their goal to reduce disparities.

2) AI-Biased vs. AI-Debiased: by comparing these two groups, we will evaluate the impact that providing a statement asserting the unbiasedness of the AI has in the presence of knowledge that AI can be biased has on the mmg.

3) AI-Biased vs. Human-Oversight: by comparing these two groups, we will evaluate the impact that providing a statement asserting human oversight has in the presence of knowledge that AI can be biased has on the mmg.

For both 2 and 3 above, providing this additional information may lead to a smaller mmg by assuaging minorities’ concerns about bias in the AI. However, this information may be ineffective or actually seen as a negative if it is seen as a band-aid, rather than a true intervention to curb disparities between majority and minority groups.

4) AI-Debiased vs. Human-Oversight: by comparing these two groups, we will evaluate the relative efficacy of providing statements of debiased AI vs. statements of human oversight in reducing the mmg.

Using this comparison, we will understand the relative efficacy of providing information about efforts to debias the AI and having human oversight in final decision making. These are two proposed ideas for how to assuage concerns generated by required disclaimers of the potential bias in AI. Assertions of human oversight may be more effective if people believe that humans make better final decisions or they are uncomfortable with AI making important decisions like hiring.

On the other hand, minorities may be concerned about the bias against themselves in human evaluators, leading to claims of human oversight being less effective.

5) AI-Explained vs. AI-Debiased: by comparing these two groups, we will evaluate the efficacy of providing statements of debiased AI relative to the benchmark of no explanation of bias for the mmg.

6) AI-Explained vs. Human-Oversight: by comparing these two groups, we will evaluate the efficacy of providing statements of human oversight relative to the benchmark of no explanation of bias for the mmg.

For both 5 and 6, we will be understanding whether the impact of informing applicants about the potential bias in AI can be mitigated by information about efforts to undo that bias (in AI-Debiased) or information about human oversight in the decision-making process (in Human-Oversight).

Secondary Outcomes

Secondary Outcomes (end points)

To understand possible mechanisms we elicit the following secondary outcome variables:

Time to complete the questions: the candidates completion time is recorded.

Number of words: This outcome is the number of words written in the assessment.

Secondary Outcomes (explanation)

See above

Experimental Design

Our design aims to measure the impact of providing information about AI and AI-generated bias on diversity in recruitment outcomes.

Experimental Design Details

The design consists of two stages.

In stage 1, we will post 2 job ads for a real temporary web designer positions: one across the United States; and one across Australia. To apply job applicants must send their CV and fill out a short survey (e.g. years of experience programming, demographics). Applicants must reside in the US or Australia, depending on the position. After applications close, we will invite all applicants to take part in an assessment. The assessment involves responding to 5 text-based questions. The questions are mainly “behavioural interview questions” and they generally ask candidates to provide an example from their personal experience of when they have demonstrated a particular work-based trait (or behaviour). Candidates are expected to write between 50-150 words per question. Prior to taking part in the assessment, candidates are randomly assigned into one of our treatments (described above).

In Stage 2, our human evaluators will evaluate the assessments of all candidates. We will use Qualtrics or another data collection company to recruit the human evaluators. Our sample of human evaluators consists of 1) people who work as web designers and who are responsible for hiring, and 2) people who work in recruitment (including recruiting web developers). Each evaluator will be shown the responses to the assessment questions of several candidates (this number will be updated prior to the commencement of stage 2, see Appendix 1). The evaluators are given a brief description of the context and their task. For each candidate they are shown the responses to the assessment questions and information taken from the candidates CV (education, first name ect). Evaluators must then rate each candidate on a scale between 0-100 (where 100 is high). To incentivise evaluators, they are told (correctly) that their evaluation score will be used when deciding on whom to hire.

After evaluating the candidates, all evaluators complete a short survey. The survey collects additional information related to the research (e.g., whether they think women/ethnic groups would perform worse on these kinds of assessments, job experience, demographics etc). This survey will be used to help understand why differences between AI and human evaluators may exist.

Both the AI and Human evaluators will evaluate all candidates. All metrics calculated by the AI algorithm such as, the candidates overall score, is set by an external AI provider and is not influenced by the researchers. The algorithm will not change throughout the project.

We will also conduct additional treatments on the side of the evaluator to help understand behaviour and possibly to test solutions. However, these will be added in Appendix 1 after the first stage of the experiment.

Randomization Method

Randomization will be carried out by a computer.

Randomization Unit

The randomization unit will be the individual for all treatments.

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

The clusters will be equal to the number of legitimate candidates. Based on previous experience we expect to have 1200 legitimate job applicants, but this will likely vary depending on a number of factors. A legitimate candidate is someone who resides in the US or Australia and completes the initial application form. We plan to assign 25% of the applications to each treatment. We will cross-stratify by country, gender, and racial minority status.

For stage 2, we plan to recruit enough human evaluators so that each candidate is assessed by several human evaluators. We will decide on the exact number of potential human evaluators based on the number of candidates in stage 1. The updated number can be found in Appendix 1.

Sample size: planned number of observations

For the applicant sample (stage 1), the number of observations is the same as the number of clusters. We hope to have up to 1200 observations. For stage 2, the AI will evaluate all candidates. The candidate pool will also be evaluated by human evaluators. Each candidate will be evaluated by several human evaluators. The number of human evaluators and the number of candidates assessed by each evaluator will be added to the pre-analysis plan before the start of stage 2 (see Appendix 1).

Sample size (or number of clusters) by treatment arms

See above. We plan to assign 25% of the sample to all treatments. Further, we will cross-stratify by gender, racial minority status, and country.

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Proportion who complete and start: Using a significance of 0.05, power of 0.8 and a value of 0.6 as the proportion who complete/start in the absence of AI we can detect a minimum effect size of 0.058. We will add the other MDE’s before the start of stage 2 once we have confirmed the sample size of stage 2.

Supporting Documents and Materials

IRB