Back to History Current Version

Does large language model technology really equalize low- and high-skill workers?

Last registered on September 03, 2025

Pre-Trial

Trial Information

General Information

Title
Does large language model technology really equalize low- and high-skill workers?
RCT ID
AEARCTR-0016607
Initial registration date
August 28, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
September 03, 2025, 8:45 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Universidad Torcuato Di Tella

Other Primary Investigator(s)

PI Affiliation
Universidad de San Andrés, University of Nottingham, CEDLAS
PI Affiliation
University of Maryland
PI Affiliation
Universidad Torcuato Di Tella

Additional Trial Information

Status
In development
Start date
2025-09-01
End date
2026-06-30
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study investigates the impact of large language model (LLM)-based chatbots on the labor market, focusing on their potential to reduce performance gaps between low- and high-skilled workers. Our experiment recruits participants with diverse skill levels, defined by educational attainment, and evaluates their performance in a task. The treatment group will complete this task with the assistance of an LLM-based chatbot, while the control group will complete it without any assistance. We will examine the differences between high- and low-skilled individuals in terms of how LLM assistance affects performance in the task.
External Link(s)

Registration Citation

Citation
Cruces, Guillermo et al. 2025. "Does large language model technology really equalize low- and high-skill workers?." AEA RCT Registry. September 03. https://doi.org/10.1257/rct.16607-1.0
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
Intervention Start Date
2025-09-01
Intervention End Date
2026-03-31

Primary Outcomes

Primary Outcomes (end points)
Performance and time taken to complete the main task, and performance on the follow up questions.
Primary Outcomes (explanation)
(1) Performance on the main task: we will measure performance on a 0-10 scale, evaluated along two dimensions: (a) content, and (b) writing. The main outcome will be an overall score, calculated as a weighted average (2/3 content, 1/3 writing). We will also analyze each dimension separately. All scores will be standardized relative to the control group.

We will use an LLM for grading. To ensure reliability, we will manually grade a random subset of responses (10%) and compare these scores with those produced by the LLM. The grading scheme for both dimensions is as follows:

*Content (0-10 points):
-Diagnosis (6 points): Up to 2 points per question for each of the 3 questions posed by the business manager/owner.
-Solution (4 points): 1 point for addressing the root cause of the problem, up to 2 points for making a concrete and specific proposal, and 1 point for realism. If the solution does not address the root cause, the respondent scores 0 for the solution subsection.

*Writing (0-10 points):
-Spelling and grammar (2 points)
-Clarity and legibility (3 points)
-Organization (4 points)
-Tone and register (1 point)
Responses that are less than 200 characters long and receive 0 points in content will also receive 0 points in writing.

An additional outcome variable measuring performance in the main task is a binary indicator equal to 1 if the respondent correctly identifies the root cause of the problem, defined as obtaining at least 1 point in the third diagnostic question posed by the business manager.

(2) Time to complete the main task: We will record the time taken to complete the task (in minutes), with values top-coded at 20 minutes.

(3) Performance on the follow-up questions: We will construct a weighted average of two indicators: (i) Score in the open-ended follow-up question, graded on a 0-2 scale, (ii) binary variables for whether the respondent correctly responded each of the two multiple choice questions. The main outcome will be an overall score, calculated as a weighted average (1/2 open-ended question, and ¼ each multiple-choice question), standardized relative to the control group. We will also analyze each dimension separately.

Secondary Outcomes

Secondary Outcomes (end points)
Perceptions of task difficulty
Secondary Outcomes (explanation)
Measured using a Likert-scale question asking respondents to rate the difficulty of the main task, as well as sentiment analysis of responses to an open-ended question about task difficulty.

Experimental Design

Experimental Design
We will conduct an online experiment in which respondents will be presented with an incentivized task based on a realistic business scenario. Participants in the treatment group will have access to an LLM-based chatbot, whereas participants in the control group will not. After completing the main task, all participants, regardless of their assigned group, will answer a series of follow-up questions related to this task, without access to the LLM-based chatbot.
Experimental Design Details
Not available
Randomization Method
After providing some demographic characteristics, respondents will be randomly assigned to the treatment and control groups. The randomization will occur at the individual level and will be implemented by the survey software (Qualtrics).
Randomization Unit
Randomization is conducted at the individual level
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
Treatment is not clustered
Sample size: planned number of observations
We will aim for a sample size of N=600. However, there are logistical challenges and costs associated with gathering the necessary sample. One of the main constraints is the need to perform the task using a computer. If the sample gathering goes smoother than expected, we will gather an even greater sample size.
Sample size (or number of clusters) by treatment arms
Half of the participants will be assigned to the treatment group (n=300), and the other half will be assigned to the control group (n=300).
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Based on a pilot we conducted, we expect a difference in the overall score in the main task of around 0.7 SD between the skilled and unskilled individuals in the control group, and effects of 0.8 SD for low-skilled workers, and 0.3 for high-skilled workers. These are approximate effect sizes, as the pilot was conducted on a limited sample. Using these pilot effect sizes as benchmarks, we conducted power calculations assuming a statistical significance level of 0.05. These calculations indicated that we would need a minimum of approximately 600 participants (300 high-skilled and 300 low-skilled) to have 80% power for detecting effect sizes for both skill groups as well as differences in the effect sizes across groups.
IRB

Institutional Review Boards (IRBs)

IRB Name
Nottingham School of Economics Research Ethics Committee
IRB Approval Date
2025-02-27
IRB Approval Number
N/A
IRB Name
CEDLAS Research Ethics Committee (CEDLAS-REC)
IRB Approval Date
2025-02-27
IRB Approval Number
N/A