Back to History Current Version

Does large language model technology really equalize low- and high-skill workers?

Last registered on September 03, 2025

Pre-Trial

Trial Information

General Information

Title
Does large language model technology really equalize low- and high-skill workers?
RCT ID
AEARCTR-0016607
Initial registration date
August 28, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
September 03, 2025, 8:45 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Universidad Torcuato Di Tella

Other Primary Investigator(s)

PI Affiliation
Universidad de San Andrés, University of Nottingham, CEDLAS
PI Affiliation
University of Maryland
PI Affiliation
Universidad Torcuato Di Tella

Additional Trial Information

Status
In development
Start date
2025-09-01
End date
2026-06-30
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study investigates the impact of large language model (LLM)-based chatbots on the labor market, focusing on their potential to reduce performance gaps between low- and high-skilled workers. Our experiment recruits participants with diverse skill levels, defined by educational attainment, and evaluates their performance in a task. The treatment group will complete this task with the assistance of an LLM-based chatbot, while the control group will complete it without any assistance. We will examine the differences between high- and low-skilled individuals in terms of how LLM assistance affects performance in the task.
External Link(s)

Registration Citation

Citation
Cruces, Guillermo et al. 2025. "Does large language model technology really equalize low- and high-skill workers?." AEA RCT Registry. September 03. https://doi.org/10.1257/rct.16607-1.0
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
Intervention (Hidden)
In this project, we will implement a randomized online experiment in Argentina to evaluate the impact of large language model (LLM)-based assistance on task performance and its potential to reduce skill-based disparities in the labor market. We will recruit low- and high-skilled subjects using a survey company. Low-skilled individuals are defined as those who have a high school diploma and have completed less than half of a postsecondary program, while high-skilled individuals are those who have completed more than half of a postsecondary education.

Participants will be randomly assigned to one of two groups: a treatment group that has access to an LLM-based chatbot and a control group that does not. Participants will first complete an incentivized task based on a realistic business scenario. In this task, participants receive an email from a hypothetical manager/business owner describing a problem and asking for help understanding its causes and identifying a solution. To respond, participants must review multiple sources of information, diagnose the root causes of the issue (answering three questions posed by the manager/owner), and propose a course of action. Their response should be written in the form of a professional email. We designed three different versions of this task; subjects will be randomly assigned to one of the three options. After completing this task, all participants will answer a series of incentivized follow-up questions related to the main task, but without access to the LLM-based chatbot, regardless of their initial group assignment. The first follow-up question asks respondents to briefly summarize what they identified as the root cause of the problem in the main task. The following two are multiple choice and assess, respectively, their interpretation of the information provided in a figure in the main task and their recall of basic factual details from the task.

We will monitor compliance with the prohibition on using LLMs in the control group by: (1) recording whether subjects exit the survey module, and (2) periodically taking screenshots of respondents' answers. We will then analyze these responses for any sudden changes that may indicate the respondent has pasted a complete response from an LLM-based chatbot.

Intervention Start Date
2025-09-01
Intervention End Date
2026-03-31

Primary Outcomes

Primary Outcomes (end points)
Performance and time taken to complete the main task, and performance on the follow up questions.
Primary Outcomes (explanation)
(1) Performance on the main task: we will measure performance on a 0-10 scale, evaluated along two dimensions: (a) content, and (b) writing. The main outcome will be an overall score, calculated as a weighted average (2/3 content, 1/3 writing). We will also analyze each dimension separately. All scores will be standardized relative to the control group.

We will use an LLM for grading. To ensure reliability, we will manually grade a random subset of responses (10%) and compare these scores with those produced by the LLM. The grading scheme for both dimensions is as follows:

*Content (0-10 points):
-Diagnosis (6 points): Up to 2 points per question for each of the 3 questions posed by the business manager/owner.
-Solution (4 points): 1 point for addressing the root cause of the problem, up to 2 points for making a concrete and specific proposal, and 1 point for realism. If the solution does not address the root cause, the respondent scores 0 for the solution subsection.

*Writing (0-10 points):
-Spelling and grammar (2 points)
-Clarity and legibility (3 points)
-Organization (4 points)
-Tone and register (1 point)
Responses that are less than 200 characters long and receive 0 points in content will also receive 0 points in writing.

An additional outcome variable measuring performance in the main task is a binary indicator equal to 1 if the respondent correctly identifies the root cause of the problem, defined as obtaining at least 1 point in the third diagnostic question posed by the business manager.

(2) Time to complete the main task: We will record the time taken to complete the task (in minutes), with values top-coded at 20 minutes.

(3) Performance on the follow-up questions: We will construct a weighted average of two indicators: (i) Score in the open-ended follow-up question, graded on a 0-2 scale, (ii) binary variables for whether the respondent correctly responded each of the two multiple choice questions. The main outcome will be an overall score, calculated as a weighted average (1/2 open-ended question, and ¼ each multiple-choice question), standardized relative to the control group. We will also analyze each dimension separately.

Secondary Outcomes

Secondary Outcomes (end points)
Perceptions of task difficulty
Secondary Outcomes (explanation)
Measured using a Likert-scale question asking respondents to rate the difficulty of the main task, as well as sentiment analysis of responses to an open-ended question about task difficulty.

Experimental Design

Experimental Design
We will conduct an online experiment in which respondents will be presented with an incentivized task based on a realistic business scenario. Participants in the treatment group will have access to an LLM-based chatbot, whereas participants in the control group will not. After completing the main task, all participants, regardless of their assigned group, will answer a series of follow-up questions related to this task, without access to the LLM-based chatbot.
Experimental Design Details
The primary goal of this experiment is to evaluate how access to large language model (LLM)-based assistance impacts task performance and whether it reduces performance gaps between participants with high and low educational attainment. Data will be collected through an online survey conducted on Qualtrics. Participants will be recruited from a survey company in Argentina that maintain panels of individuals who have consented to participate in research. Participants must be between the ages of 25 and 45, and answer the survey using a computer up to one-hour after starting. They will be paid the standard rate to participate.

We will seek to obtain a balanced sample size by gender, age group (below and above age 35) and educational attainment (high school diploma and completed less than half of a postsecondary program vs. at least more than half of a postsecondary education program).

Participants will be randomly assigned to one of two groups:
-Control group: Does not have access to an LLM-based chatbot.
-Treatment group: Has access to an LLM-based chatbot for the main task.

Participants will first complete some sociodemographic questions. After this, respondents from the treatment group will be given a brief tutorial on how to use the LLM-based chatbot. Participants will then be presented with a task based on a realistic business scenario (see description in Intervention section above). There are three different versions of this task; subjects will be randomly assigned to one of the three options. After completing the main task, all participants, regardless of their assigned group, will answer a series of follow-up questions related to this task, without access to the LLM-based chatbot.

Participants will receive an incentive of up to 25 thousand Argentine pesos for their performance in the main task and follow-up questions. Respondents will be rewarded for their overall performance as compared to individuals from the same skill level (low vs. high) and treatment group (control vs. treatment).

Our main research question is the impact of access to the LLM-based chatbot on the performance gap between high and low skill workers. We will thus compare the treatment effect of providing the AI tool between both skill groups by estimating regressions of our outcome variables with respect to a dummy variable for whether the individual is in the unskilled group, a treatment dummy, and the interaction between the two (our coefficient of interest). To improve precision, we will control for task fixed effects, age, gender, labor market status, educational attainment, and education/work experience related to the task.

While statistical power will be limited (power calculations were conducted to detect an effect in the performance gap with and without AI between skill groups), we will also conduct an exploratory of differences in treatment effects on the gap by gender and age. Finally, we will also estimate the average effect of access to LLM pooling workers of both skill levels.

In case of observing differential attrition, we will compute treatment effect bounds using tightening bound methods.
Randomization Method
After providing some demographic characteristics, respondents will be randomly assigned to the treatment and control groups. The randomization will occur at the individual level and will be implemented by the survey software (Qualtrics).
Randomization Unit
Randomization is conducted at the individual level
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
Treatment is not clustered
Sample size: planned number of observations
We will aim for a sample size of N=600. However, there are logistical challenges and costs associated with gathering the necessary sample. One of the main constraints is the need to perform the task using a computer. If the sample gathering goes smoother than expected, we will gather an even greater sample size.
Sample size (or number of clusters) by treatment arms
Half of the participants will be assigned to the treatment group (n=300), and the other half will be assigned to the control group (n=300).
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Based on a pilot we conducted, we expect a difference in the overall score in the main task of around 0.7 SD between the skilled and unskilled individuals in the control group, and effects of 0.8 SD for low-skilled workers, and 0.3 for high-skilled workers. These are approximate effect sizes, as the pilot was conducted on a limited sample. Using these pilot effect sizes as benchmarks, we conducted power calculations assuming a statistical significance level of 0.05. These calculations indicated that we would need a minimum of approximately 600 participants (300 high-skilled and 300 low-skilled) to have 80% power for detecting effect sizes for both skill groups as well as differences in the effect sizes across groups.
IRB

Institutional Review Boards (IRBs)

IRB Name
Nottingham School of Economics Research Ethics Committee
IRB Approval Date
2025-02-27
IRB Approval Number
N/A
IRB Name
CEDLAS Research Ethics Committee (CEDLAS-REC)
IRB Approval Date
2025-02-27
IRB Approval Number
N/A

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials