Advanced Search

Back to History Current Version

Does large language model technology really equalize low- and high-skill workers?

Last registered on September 03, 2025

View Trial History

Pre-Trial

Trial Information

General Information

Title

Does large language model technology really equalize low- and high-skill workers?

RCT ID

AEARCTR-0016607

Initial registration date

August 28, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

September 03, 2025, 8:45 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Name

Affiliation

Universidad Torcuato Di Tella

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Guillermo Cruces

PI Affiliation

Universidad de San Andrés, University of Nottingham, CEDLAS

PI Name

Sebastián Galiani

PI Affiliation

University of Maryland

PI Name

PI Affiliation

Universidad Torcuato Di Tella

Additional Trial Information

Status

In development

Start date

2025-09-01

End date

2026-06-30

Keywords

Education, Firms & Productivity, Labor

Additional Keywords

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

This study investigates the impact of large language model (LLM)-based chatbots on the labor market, focusing on their potential to reduce performance gaps between low- and high-skilled workers. Our experiment recruits participants with diverse skill levels, defined by educational attainment, and evaluates their performance in a task. The treatment group will complete this task with the assistance of an LLM-based chatbot, while the control group will complete it without any assistance. We will examine the differences between high- and low-skilled individuals in terms of how LLM assistance affects performance in the task.

External Link(s)

Registration Citation

Citation

Cruces, Guillermo et al. 2025. "Does large language model technology really equalize low- and high-skill workers?." AEA RCT Registry. September 03. https://doi.org/10.1257/rct.16607-1.0

Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Experimental Details

Interventions

Intervention(s)

Intervention (Hidden)

In this project, we will implement a randomized online experiment in Argentina to evaluate the impact of large language model (LLM)-based assistance on task performance and its potential to reduce skill-based disparities in the labor market. We will recruit low- and high-skilled subjects using a survey company. Low-skilled individuals are defined as those who have a high school diploma and have completed less than half of a postsecondary program, while high-skilled individuals are those who have completed more than half of a postsecondary education.

Participants will be randomly assigned to one of two groups: a treatment group that has access to an LLM-based chatbot and a control group that does not. Participants will first complete an incentivized task based on a realistic business scenario. In this task, participants receive an email from a hypothetical manager/business owner describing a problem and asking for help understanding its causes and identifying a solution. To respond, participants must review multiple sources of information, diagnose the root causes of the issue (answering three questions posed by the manager/owner), and propose a course of action. Their response should be written in the form of a professional email. We designed three different versions of this task; subjects will be randomly assigned to one of the three options. After completing this task, all participants will answer a series of incentivized follow-up questions related to the main task, but without access to the LLM-based chatbot, regardless of their initial group assignment. The first follow-up question asks respondents to briefly summarize what they identified as the root cause of the problem in the main task. The following two are multiple choice and assess, respectively, their interpretation of the information provided in a figure in the main task and their recall of basic factual details from the task.

We will monitor compliance with the prohibition on using LLMs in the control group by: (1) recording whether subjects exit the survey module, and (2) periodically taking screenshots of respondents' answers. We will then analyze these responses for any sudden changes that may indicate the respondent has pasted a complete response from an LLM-based chatbot.

Intervention Start Date

2025-09-01

Intervention End Date

2026-03-31

Primary Outcomes

Primary Outcomes (end points)

Performance and time taken to complete the main task, and performance on the follow up questions.

Primary Outcomes (explanation)

(1) Performance on the main task: we will measure performance on a 0-10 scale, evaluated along two dimensions: (a) content, and (b) writing. The main outcome will be an overall score, calculated as a weighted average (2/3 content, 1/3 writing). We will also analyze each dimension separately. All scores will be standardized relative to the control group.

We will use an LLM for grading. To ensure reliability, we will manually grade a random subset of responses (10%) and compare these scores with those produced by the LLM. The grading scheme for both dimensions is as follows:

*Content (0-10 points):
-Diagnosis (6 points): Up to 2 points per question for each of the 3 questions posed by the business manager/owner.
-Solution (4 points): 1 point for addressing the root cause of the problem, up to 2 points for making a concrete and specific proposal, and 1 point for realism. If the solution does not address the root cause, the respondent scores 0 for the solution subsection.

*Writing (0-10 points):
-Spelling and grammar (2 points)
-Clarity and legibility (3 points)
-Organization (4 points)
-Tone and register (1 point)
Responses that are less than 200 characters long and receive 0 points in content will also receive 0 points in writing.

An additional outcome variable measuring performance in the main task is a binary indicator equal to 1 if the respondent correctly identifies the root cause of the problem, defined as obtaining at least 1 point in the third diagnostic question posed by the business manager.

(2) Time to complete the main task: We will record the time taken to complete the task (in minutes), with values top-coded at 20 minutes.

(3) Performance on the follow-up questions: We will construct a weighted average of two indicators: (i) Score in the open-ended follow-up question, graded on a 0-2 scale, (ii) binary variables for whether the respondent correctly responded each of the two multiple choice questions. The main outcome will be an overall score, calculated as a weighted average (1/2 open-ended question, and ¼ each multiple-choice question), standardized relative to the control group. We will also analyze each dimension separately.

Secondary Outcomes

Secondary Outcomes (end points)

Perceptions of task difficulty

Secondary Outcomes (explanation)

Measured using a Likert-scale question asking respondents to rate the difficulty of the main task, as well as sentiment analysis of responses to an open-ended question about task difficulty.

Experimental Design

Experimental Design

We will conduct an online experiment in which respondents will be presented with an incentivized task based on a realistic business scenario. Participants in the treatment group will have access to an LLM-based chatbot, whereas participants in the control group will not. After completing the main task, all participants, regardless of their assigned group, will answer a series of follow-up questions related to this task, without access to the LLM-based chatbot.

Experimental Design Details

The primary goal of this experiment is to evaluate how access to large language model (LLM)-based assistance impacts task performance and whether it reduces performance gaps between participants with high and low educational attainment. Data will be collected through an online survey conducted on Qualtrics. Participants will be recruited from a survey company in Argentina that maintain panels of individuals who have consented to participate in research. Participants must be between the ages of 25 and 45, and answer the survey using a computer up to one-hour after starting. They will be paid the standard rate to participate.

We will seek to obtain a balanced sample size by gender, age group (below and above age 35) and educational attainment (high school diploma and completed less than half of a postsecondary program vs. at least more than half of a postsecondary education program).

Participants will be randomly assigned to one of two groups:
-Control group: Does not have access to an LLM-based chatbot.
-Treatment group: Has access to an LLM-based chatbot for the main task.

Participants will first complete some sociodemographic questions. After this, respondents from the treatment group will be given a brief tutorial on how to use the LLM-based chatbot. Participants will then be presented with a task based on a realistic business scenario (see description in Intervention section above). There are three different versions of this task; subjects will be randomly assigned to one of the three options. After completing the main task, all participants, regardless of their assigned group, will answer a series of follow-up questions related to this task, without access to the LLM-based chatbot.

Participants will receive an incentive of up to 25 thousand Argentine pesos for their performance in the main task and follow-up questions. Respondents will be rewarded for their overall performance as compared to individuals from the same skill level (low vs. high) and treatment group (control vs. treatment).

Our main research question is the impact of access to the LLM-based chatbot on the performance gap between high and low skill workers. We will thus compare the treatment effect of providing the AI tool between both skill groups by estimating regressions of our outcome variables with respect to a dummy variable for whether the individual is in the unskilled group, a treatment dummy, and the interaction between the two (our coefficient of interest). To improve precision, we will control for task fixed effects, age, gender, labor market status, educational attainment, and education/work experience related to the task.

While statistical power will be limited (power calculations were conducted to detect an effect in the performance gap with and without AI between skill groups), we will also conduct an exploratory of differences in treatment effects on the gap by gender and age. Finally, we will also estimate the average effect of access to LLM pooling workers of both skill levels.

In case of observing differential attrition, we will compute treatment effect bounds using tightening bound methods.

Randomization Method

After providing some demographic characteristics, respondents will be randomly assigned to the treatment and control groups. The randomization will occur at the individual level and will be implemented by the survey software (Qualtrics).

Randomization Unit

Randomization is conducted at the individual level

Was the treatment clustered?

No

Experiment Characteristics

Sample size: planned number of clusters

Treatment is not clustered

Sample size: planned number of observations

We will aim for a sample size of N=600. However, there are logistical challenges and costs associated with gathering the necessary sample. One of the main constraints is the need to perform the task using a computer. If the sample gathering goes smoother than expected, we will gather an even greater sample size.

Sample size (or number of clusters) by treatment arms

Half of the participants will be assigned to the treatment group (n=300), and the other half will be assigned to the control group (n=300).

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Based on a pilot we conducted, we expect a difference in the overall score in the main task of around 0.7 SD between the skilled and unskilled individuals in the control group, and effects of 0.8 SD for low-skilled workers, and 0.3 for high-skilled workers. These are approximate effect sizes, as the pilot was conducted on a limited sample. Using these pilot effect sizes as benchmarks, we conducted power calculations assuming a statistical significance level of 0.05. These calculations indicated that we would need a minimum of approximately 600 participants (300 high-skilled and 300 low-skilled) to have 80% power for detecting effect sizes for both skill groups as well as differences in the effect sizes across groups.

Supporting Documents and Materials

Institutional Review Boards (IRBs)

IRB Name

Nottingham School of Economics Research Ethics Committee

IRB Approval Date

2025-02-27

IRB Approval Number

N/A

IRB Name

CEDLAS Research Ethics Committee (CEDLAS-REC)

IRB Approval Date

2025-02-27

IRB Approval Number

N/A

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?

No

Data Collection Complete

Data Publication

Data Publication

Is public data available?

No

Program Files

Program Files

Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials