Recruitment Screening with Human and Artificial Intelligence

Last registered on January 30, 2024

Pre-Trial

Trial Information

General Information

Title
Recruitment Screening with Human and Artificial Intelligence
RCT ID
AEARCTR-0011651
Initial registration date
July 04, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
July 10, 2023, 9:23 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
January 30, 2024, 12:52 PM EST

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Region

Primary Investigator

Affiliation
University of Zurich

Other Primary Investigator(s)

PI Affiliation
University of Zurich
PI Affiliation
University of Zurich

Additional Trial Information

Status
Completed
Start date
2023-07-05
End date
2023-08-30
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study aims to investigate the potential role of artificial intelligence (AI) in assisting in the screening for labor in the recruitment process, with the use of generative large language models.

We will conduct a randomized controlled trial (RCT) in assisting a non-profit organization in recruiting talent – teachers – within the education sector in a sub-Saharan country. The experiment will involve human evaluators who assess incoming applications for the NGO's fellowship program, with or without the assistance of AI. The human evaluators are tasked to screen applications and decide who continues to the next in-depth interview phase, based on a set of criteria deemed desirable to the organization. We give the same task to the generative AI using prompts. We will explore the disagreement between human and AI in evaluating applications, the human behavioral responses under the AI-assistance in terms of effort and application grading, and the quality of candidates among those that are passed onto the interview phase.

Applications will be randomized into four treatment groups: Human-only (control); Human with AI score assistance; Human with AI score and rationale assistance, and; AI-only. The treatment group will determine whether the evaluator will receive AI assistance in scoring an application and what score (AI-generated or evaluator-determined) will be used to select applicants. In addition to behavioral responses by human evaluators when assisted, we will compare the efficiency of AI-only and AI-assisted recruitment to human-only screening, focusing on the quality of selected applicants, screening costs, and speed.
External Link(s)

Registration Citation

Citation
Awuah, Kobbina, Ursa Krenk and David Yanagizawa-Drott. 2024. "Recruitment Screening with Human and Artificial Intelligence ." AEA RCT Registry. January 30. https://doi.org/10.1257/rct.11651-1.1
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
The intervention consists in providing evaluators - people who assess incoming applications - with AI assistance. AI assistance can take two forms: providing them with the scores that the AI would have predicted for a given application, or providing them with the scores and an AI-generated explanation for the choice. The AI-only condition assesses applications without human grading and decisions.
Intervention Start Date
2023-07-05
Intervention End Date
2023-08-30

Primary Outcomes

Primary Outcomes (end points)
Using data from the first three conditions, we can estimate the behavioral responses by human evaluators under AI assistance, in terms of:
Time spent grading the applications (effort)
Grading decisions (a dummy for whether the AI and human evaluators’ final scores are the same; a continuous measure of disagreement in scores)

For application outcomes, we will examine the quality of the top-k selected candidates in each randomized group, as measured in the NGO’s in-depth interviews during the assessment center phase.
Primary Outcomes (explanation)
Time spent grading the applications (effort)
We will use the time spent assessing an application as a proxy for the effort level. This will be measured via the Qualtrics platform, where all the grading takes place. We will compare this value across the different groups. Evaluators could choose to spend more time on the AI-assisted applications to have greater confidence in their score, or the could choose to just skim the application as they trust the AI. Note that the evaluators are informed of whether they will receive AI assistance for a given application.

Grading decisions
For any application question (six in total), we can calculate the disagreement in scores between any human evaluator score and the AI score. In the conditions with AI assistance, we can observe human evaluator scores both before and after receiving the AI feedback. The final score is defined as the score after AI feedback, if any. The pre-revision score is the score by the human evaluator before receiving the AI feedback, but after knowing whether such feedback will be given. The final score is the one that will be used for calculations of disagreement in the main outcome. Auxiliary calculations will be conducted to see if the conditions also affect the pre-revision score, which may occur if there are effort/anticipation effects.

Application outcomes
We will determine the quality of the candidates selected of the top-k selected candidates in each group that are passed on to the in-depth interview assessment centers. The NGO sets the criteria they evaluate on. We will use their scores. Since we do not know how many applications there will be in total, we also do not know how many applications there will be in each condition, or how many that will be passed on to the interviews. Therefore, we cannot specify the exact number k, but will make sure to estimate treatment difference across the same number of top-k applications.

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
All grading will take place on the Qualtrics platform. We randomize applications into four treatment groups: Human-only decisions (control); Human decisions with AI score assistance; Human decisions with AI score and rationale assistance, and; AI-only decisions. Thus, the treatment group will determine two things: (1) whether the evaluator will receive AI assistance in scoring an application and (2) what score (AI-generated or evaluator-determined) will be used to select applicants. The human evaluators will be informed whether and what type of AI assistance will be given, at the beginning of an application. If assistance is given, it will be provided after the initial score is entered by the human evaluator. Once entered, the information from the AI assistance will be revealed and the human evaluator will be asked for a final score.
Experimental Design Details
The experimental platform will be hosted on Qualtrics, where evaluators will grade applications using their respective treatment group's form. The application data will be uploaded and randomized by the University of Zurich team to determine treatment assignment. Evaluation results will be collected by the University and shared with Lead For Ghana for further selection processes.
An important aspect of the experiment is to maintain confidentiality in terms of treatment groupings and scores. Lead For Ghana will not be informed of which applications were assigned to which treatment group, and evaluators will only know whether or not they receive AI assistance.
In summary, this experimental design aims to investigate the impact of AI assistance on the evaluation process for Lead For Ghana applications. By examining how AI-generated scores and rationales affect human evaluators' decisions, the study will provide insights into the potential benefits and limitations of incorporating AI systems into decision-making processes.
Randomization Method
Randomization done in office by a computer.
Randomization Unit
Application.
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
We do not know in advance how many applications there will be. The hope is that at least 400 applications will be received.
Sample size: planned number of observations
For each application, we will observe grading on 6 questions. We do not know in advance how many applications there will be. The hope is that at least 400 applications will be received, in which case 2400 application-question scores will be the number of observations in total across all four conditions. In the subset of human-only and human with AI-assistance, it would result in 1800 application-question scores. For application outcomes measured in the assessment center, we do not know in advance the number of observations, as it depends on how many will be receiving a passing grade.
Sample size (or number of clusters) by treatment arms
We do not know in advance how many applications there will be. The hope is that at least 400 applications will be received, in which case we will have 100 applications across the four conditions.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
A few important factors make it very difficult to perform meaningful ex ante power calculations. First, we do not know in advance how many applications there will be. Second, we do not have any historical data on key outcomes such as time spent on application, or disagreement scores in the presence of an AI system. Therefore, variances and intra-cluster correlation are not known, and very difficult to have a reasonable prior on. Third, we do not know how many applicants will be passed on to the assessment center. Together, these factors mean that power calculations are arguably not very meaningful. That said, below we provide some highly suggestive calculations based on various assumptions. TIME SPENT ON EACH QUESTION We simulate experimental data following normal distributions parameterized as follows: human mu = 120 sd = 40 ai-score mu = 100 sd = 50 ai-rationale mu = 90 sd = 20 We assume 60% intra-cluster correlation, but this is not based on any real data. POWER AND STATA CODE For 300 total applications human vs ai score: power 0.912 power twomeans 120 100, cluster m1(6) m2(6) k1(75) k2(75) sd1(40) sd2(50) rho(0.6) human vs ai rationale: power = 1.000 power twomeans 120 90, cluster m1(6) m2(6) k1(75) k2(75) sd1(40) sd2(20) rho(0.6) For 600 total applications: human vs ai score: power = 0.9968 power twomeans 120 100, cluster m1(6) m2(6) k1(150) k2(150) sd1(40) sd2(50) rho(0.6) human vs ai-rationale power = 1.000 power twomeans 120 90, cluster m1(6) m2(6) k1(75) k2(75) sd1(40) sd2(20) rho(0.6) PROBABILITY OF REVISION Assuming baseline revision 0.1 (AI score only) and 0.2 for AI rationale, 30% intra-cluster correlation: 300 Total Applications power twoproportions 0.1 0.2, cluster m1(6) m2(6) k1(75) k2(75) rho(0.3) power = 0.7592 600 Total Applications power twoproportions 0.1 0.2, cluster m1(6) m2(6) k1(150) k2(150) rho(0.3) power = 0.9653 Assuming baseline revision 0.4 (AI score only) and 0.6 for AI rationale 300 Total applications power = 0.969 AC SCORE CORRELATION Data from last year (N=137) has a mean of 3.38 and a standard deviation of 0.42. We assume that means can vary and that standard deviation is constant across groups. We look at mean AC score for the top k=20 candidates in each group. MDE: 0.38 power twomeans 3.38 3.76, sd(0.42) n(40) POWER ASSUMING AN EFFECT OF 0.5: 0.956 power twomeans 3.38 3.88, sd(0.42) n(40) The comparison will happen at the following levels: a) Any AI assistance vs. Human-only b) AI-score-only vs Human-only c) AI-rationale vs Human-only d) AI-only vs. Human-only Given that we assume constant standard deviation and the baseline (human-only) remains the same, the power calculations apply to all levels.
IRB

Institutional Review Boards (IRBs)

IRB Name
University of Zurich, Human Subjects Committee of the Faculty of Economics, Business Administration, and Information Technology
IRB Approval Date
2023-06-12
IRB Approval Number
2023-046

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials