Experimental Evidence on the Productivity Effects of Large Language Models

Last registered on April 25, 2023

View Trial History

Pre-Trial

Trial Information

General Information

Title

Experimental Evidence on the Productivity Effects of Large Language Models

RCT ID

AEARCTR-0010882

Initial registration date

January 27, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

January 30, 2023, 5:59 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated

April 25, 2023, 1:51 PM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Country

United States of America

Region

Primary Investigator

Name

Shakked Noy

Affiliation

MIT

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Whitney Zhang

PI Affiliation

MIT

Additional Trial Information

Status

Completed

Start date

2023-01-27

End date

2023-02-24

Keywords

Labor

Additional Keywords

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

We examine the productivity effects of a new artificial intelligence technology—large language models—in the context of creative tasks. In an online experiment, we assign occupation-specific writing tasks to a variety of college-educated professionals, and randomly expose half of them to the assistive chatbot ChatGPT. We examine takeup of ChatGPT, effects on productivity and output quality, complementarity with baseline ability, effects on task structure, skill demand, and the distribution of ideas, and effects on job satisfaction, self-efficacy, and automation concerns.

External Link(s)

Registration Citation

Citation

Noy, Shakked and Whitney Zhang. 2023. "Experimental Evidence on the Productivity Effects of Large Language Models." AEA RCT Registry. April 25. https://doi.org/10.1257/rct.10882-2.1

Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Experimental Details

Interventions

Intervention(s)

600 professionals on Prolific complete two occupation-specific incentivized writing tasks. Between the first and second task, the treatment group is asked to sign up for a ChatGPT account.

Intervention Start Date

2023-01-27

Intervention End Date

2023-02-24

Primary Outcomes

Primary Outcomes (end points)

See uploaded analysis plan for full specification of regressions.

Outcomes:
- Grades
- Time spent
- Earnings per minute, incl. normalized by Prolific workers' outside option
- Subjective usefulness of ChatGPT, willingness to pay for ChatGPT in job

Primary Outcomes (explanation)

See analysis plan

Secondary Outcomes

Secondary Outcomes (end points)

Job satisfaction, self-efficacy, concerns about automation, followup survey results

Secondary Outcomes (explanation)

Experimental Design

Details of experimental design will become public when the trial finishes.

Experimental Design Details

Recruitment
We will recruit participants through Prolific, aiming for N = 600. (We may add more N if we get more grant money).

Participants will complete a screening survey that elicits their occupation, and people in eligible occupations will be asked whether they consent to be invited to a 1-hour followup survey involving writing tasks and signing up for an online account. The screening survey will have pre-screeners applied that restrict to people with college degrees, in fulltime employment, and working in a set of “Industry Roles” that encompass the occupations we are interested in. If we end up struggling to reach N = 600, we will relax the college degree requirement, and also the Industry prescreeners (to capture people in our occupations of interest who did not respond to the Industry prescreener questions when signing up to Prolific).

The occupations we are interested in are marketers, PR professionals, copywriters, freelance writers, grant/proposal writers, technical writers, policy analysts, management consultants, business strategists, HR professionals, managers (office jobs), and data analysts. Managers appear to be the most abundant occupation, so we will max out on attempted recruitment of all other occupations and intentionally moderate the pace of manager recruitment with an aim of about 30% managers in the final sample. However, if data collection for other occupations slows down considerably, we will go back to recruiting managers in order to fill out our sample size.

Participants who consent will be invited to the followup survey via custom invite lists on Prolific. They will again be asked to consent to participate in the survey.

Our survey requires treatment-group respondents to sign up for ChatGPT, and ChatGPT signups are often down. We will aim to have the main survey collection active only during periods of time when ChatGPT is reliably up (~5-11pm EST). If we end up collecting responses during times when ChatGPT is down, we will drop all responses (treatment and control) collected during the down period.

Main Survey
The main survey proceeds as follows:
Respondents are cross-randomized into one of three incentive conditions that are applied to their writing tasks:
Linear incentives (40% of sample): each task is graded on a 1-7 scale, and the respondent receives $1 for each point they receive on each task. The task is expected to take 20-30 minutes.
Convex incentives (40% of sample): same as above, except the respondent receives a $3 bonus payment for getting a grade of 6 or 7. The task is expected to take 20-30 minutes.
Exact time (20% of sample): same as linear incentives, except the respondent is told they will be required to spend exactly 15 minutes on the task, their activity will be tracked, and if they are not on-task for the full 15 minutes their bonus payment will be withheld.
We operationalize checking for on-task activity by taking a snapshot of the respondent’s essay each minute, as well as how many words the respondent has added or modified that minute.
Respondents are told the survey will involve two occupation-specific writing tasks following the incentive scheme they’re randomized into, as well as signing up for an online account. They can consent or not.
Respondents who consent are shown a more comprehensive set of instructions.
Respondents answer two comprehension questions about the instructions, and respondents who fail twice to get both questions correct are screened out.
We elicit demographics:
Salary, hours worked, job tenure
Which of a list of websites/softwares they are aware of and have used before (including ChatGPT)
Respondents are shown task instructions again, then shown their first task prompt and asked to complete it. Each occupation is associated with two task prompts, and the order in which the respondents receive the prompts is randomized.
Respondents answer a few questions about the first task
How they split their time between brainstorming, writing a rough draft, and editing
Rate how realistic it is and answer whether they have done a similar task before, and how frequently
Job satisfaction and self-efficacy
Respondents are randomized into treatment and control.
The treatment group is required to sign up for ChatGPT and upload three screenshots of them entering prompts we’ve asked them to enter into ChatGPT.
The control group is required to sign up for Overleaf, and upload three screenshots of them compiling documents with sentences we’ve asked for.
Respondents complete their second writing task. Treatment group respondents are told they are allowed to use ChatGPT if they find it helpful.
Respondents are asked followup questions
How they split their time
Job satisfaction and self-efficacy, as well as concerns about automation
All respondents are asked whether they used ChatGPT on the second task
If they answer yes, they are given a set of questions about how they used it
All respondents are asked what software they used on the first task
End of survey

Grading
We are considering two grading methods:
Hiring people in the relevant occupations on Upwork to grade the tasks. We can budget for roughly 2 Upworker grades per task if we choose this option.
Using people on Prolific in the relevant occupations to grade the tasks. We can budget for roughly 4-5 Prolificers per task if we choose this option.

We may add additional screening for selecting graders to ensure that they have expertise in the task area, such as asking about job tenure or whether they have done certain types of tasks before.

Once we have collected some essays, we will pilot both grading methods. We will check two things:
How long graders on each platform spend per essay (we are aiming for ~5 minutes per essay ideally).
How correlated the graders are with each other.

Based on pilot evidence, we will choose a grading method to use for the full set of tasks.

If we use the Prolific method (4-5 graders per essay), we may perform Empirical Bayes corrections on individual graders’ grades to shrink them towards the mean grade for that essay.

Depending on how successful we are at getting graders to assign a roughly uniform distribution of grades, we may also decide to standardize grades at the grader level.

Followup Survey
We are planning a 14- or 30-day followup survey with experimental participants, where we ask them whether they are using ChatGPT in their real job (and potentially other questions, depending on main experiment results, TBD).

Randomization Method

Qualtrics

Randomization Unit

Individual respondent

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

600

Sample size: planned number of observations

600 respondents, 1200 task completions

Sample size (or number of clusters) by treatment arms

50/50 treatment/control

orthogonal cross-randomization: 40/40/20 linear/convex/exact time incentives

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

IRB

Institutional Review Boards (IRBs)

IRB Name

MIT COUHES

IRB Approval Date

2023-01-27

IRB Approval Number

2212000849A002

Analysis Plan

Analysis Plan Documents

Planned Analyses.docx

MD5: 6dda54b6fff72c75abca44a6e7d1f5cc

SHA1: 8bfb94d0df23ea340d11c4752119924ccff2d5ce

Uploaded At: January 27, 2023

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?

Yes

Intervention Completion Date

February 24, 2023, 12:00 +00:00

Data Collection Complete

Yes

Data Collection Completion Date

February 24, 2023, 12:00 +00:00

Final Sample Size: Number of Clusters (Unit of Randomization)

457

Was attrition correlated with treatment status?

Yes

Final Sample Size: Total Number of Observations

457

Final Sample Size (or Number of Clusters) by Treatment Arms

238 control, 219 treatment

Data Publication

Is public data available?

Program Files

Reports, Papers & Other Materials

Experimental Evidence on the Productivity Effects of Large Language Models

Pre-Trial

General Information

Locations

Primary Investigator

Other Primary Investigator(s)

Additional Trial Information

Registration Citation

Interventions

Primary Outcomes

Secondary Outcomes

Experimental Design

Experiment Characteristics

Institutional Review Boards (IRBs)

Analysis Plan Documents

Post-Trial

Study Withdrawal

Intervention

Data Publication

Program Files

Relevant Paper(s)

Reports & Other Materials