The impact of generative AI on productivity in knowledge work

Last registered on June 22, 2026

View Trial History

Pre-Trial

Trial Information

General Information

Title

The impact of generative AI on productivity in knowledge work

RCT ID

AEARCTR-0018931

Initial registration date

June 15, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

June 22, 2026, 6:45 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

Belgium

Region

Primary Investigator

Name

Louis Lippens

Affiliation

Ghent University

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Bram Van der Linden

PI Affiliation

Ghent University

Contact Investigator

Additional Trial Information

Status

In development

Start date

2026-06-15

End date

2026-09-21

Keywords

Behavior, Firms & Productivity, Lab, Labor

Additional Keywords

Labor productivity, Generative AI, Work-related tasks

JEL code(s)

J24, 033, C93, D83

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

This study examines whether access to generative AI improves labor productivity, defined here as performance on short work-related tasks. Participants are randomly assigned at the individual level to either a treatment group with access to an integrated in-survey ChatGPT interface or a control group without access to generative AI or other external tools. All participants complete the same four productivity tasks designed to capture different dimensions of knowledge work: writing, information synthesis, creativity and data interpretation. For each task, completion time is recorded, and output quality is scored using pre-specified task-specific criteria; completion time is reverse-coded so that higher values indicate better performance. These task-level measures are standardized and averaged to produce a task-level productivity score. The primary outcome is the participant-level average of the four task-level productivity scores, i.e., the global productivity index.

The main analysis estimates the effect of treatment assignment on the global productivity index using OLS regression of the outcome on a treatment indicator, with the two-sided hypothesis that AI access affects productivity. Secondary analysis examines whether prior AI experience moderates this effect via an interaction between said experience and the treatment assignment. An additional secondary analysis examines performance on a separate memory task administered at the end of the experiment, after treatment group participants no longer have access to the integrated AI interface, to assess whether earlier AI use affects later recall. Because this task captures memory rather than contemporaneous task productivity, it is analyzed separately from the primary four-task global productivity index.

External Link(s)

Registration Citation

Citation

Lippens, Louis and Bram Van der Linden. 2026. "The impact of generative AI on productivity in knowledge work." AEA RCT Registry. June 22. https://doi.org/10.1257/rct.18931-1.0

Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Experimental Details

Interventions

Intervention(s)

Intervention Start Date

2026-06-15

Intervention End Date

2026-07-20

Primary Outcomes

Primary Outcomes (end points)

The primary outcome is a participant-level global productivity index. For each of the four core tasks, a task-specific productivity score is computed as the average of a task-specific standardized quality score and a reverse-coded standardized completion time, with higher values indicating better performance. The global productivity index is then calculated as the average of these four task-specific productivity scores. Task completion time and output quality are used separately as outcomes in secondary analysis.

Primary Outcomes (explanation)

For each participant i and task j, a quality score Q_ij is assigned based on predefined task-specific scoring criteria and completion time T_ij is automatically recorded in Qualtrics. Quality scores are based on task-specific scoring rubrics defined before analysis; where human scoring is used, the same rubric will be applied uniformly across treatment arms. Both measures are standardized within each task. Because lower completion time indicates greater efficiency, the time score is reverse-coded before standardization. The task-specific productivity score is then defined as the average of the standardized quality score and the reversed standardized time score. The primary endpoint is the participant-level average of these four task-specific productivity scores. Any alternative weighting of quality and time, if reported, will be treated as exploratory sensitivity analysis rather than as part of the primary outcome definition.

Secondary Outcomes

Secondary Outcomes (end points)

Secondary outcomes are: (i) the task-level productivity scores for each of the four core tasks; (ii) standardized completion time and standardized quality analyzed separately, both overall and per task; (iii) the moderation of the treatment effect by self-reported prior AI experience, tested via a treatment × experience interaction (experience mean-centered); and (iv) performance on the end-of-experiment recall task, comparing how much of their own earlier creativity-task output participants can reproduce once the AI interface is removed. Exploratory analyses include prompt-level behavior in the treatment arm (e.g., prompt length and number of prompts), augmentation-versus-substitution patterns, and the diversity of creative outputs across participants within each arm.

Secondary Outcomes (explanation)

Experimental Design

This study uses a between-subjects randomized experiment with two conditions. Participants are randomly assigned in Qualtrics to either a treatment group or a control group. The study is administered online and includes adults aged 18 or older who are transitioning to the labor market or in early career stages, including higher-education students nearing graduation, job seekers, and individuals who have recently started working. Participants in the treatment group are allowed to use generative AI and are given in-survey access to OpenAI’s o4-mini model during the experimental tasks via an integrated ChatGPT interface. Participants in the control group are not allowed to use generative AI or other external tools.

All participants complete the same four core productivity tasks under otherwise similar conditions, measuring four dimensions of knowledge work: writing, information synthesis, creativity, and data interpretation. These four tasks constitute the primary global productivity index. In addition, all participants complete a separate fifth task: an unannounced recall (memory) task administered at the very end of the experiment, after treatment-group participants no longer have access to the integrated AI interface. Because this task captures memory rather than contemporaneous task productivity, it is analyzed separately and is not part of the primary four-task index.

To minimize order and fatigue effects, the creativity task is always presented first, the recall task is always shown last, and the remaining three core tasks are presented in a randomized order. The creativity task is fixed in the first position because it is linked to the later recall measure; keeping it fixed maintains a comparable retention interval between initial idea generation and the recall task across participants, whereas fully randomizing it would introduce avoidable variation in that delay and could bias treatment-control comparisons on the memory outcome. This design allows the study to examine whether earlier AI use affects later recall of self-generated ideas, consistent with theories of cognitive offloading and with evidence that expectations of external information availability can reduce later recall.

The primary analysis is an intention-to-treat (ITT) comparison: it includes all participants who complete the study and provide outcome data, analyzed according to their randomized assignment, with no exclusions based on post-randomization behavior. Participants are considered non-analyzable only for reasons unrelated to treatment, such as failure to complete the survey, technical issues, or duplicate participation. Indicators of likely prohibited tool use in the control group are defined ex ante from Qualtrics embedded paradata (window-focus / tab-switch events and copy/paste behavior on task pages); because these indicators are measured after randomization, any analysis that excludes participants on this basis is reported as a secondary sensitivity analysis rather than as the primary specification. All such exclusions are reported transparently.

The global productivity index is the single confirmatory outcome. The task-level and component (time, quality) analyses, the moderation by prior AI experience, and the recall analysis are secondary tests within each of these families are corrected for multiple hypotheses or are reported as explicitly exploratory.

Experimental Design Details

Not available

Randomization Method

Randomization is automatically carried out in the Qualtrics Survey Flow using simple individual-level random assignment with a 0.55 probability of assignment to the control group and a 0.45 probability of assignment to the treatment group. This modest over-allocation to the control group is intended to offset anticipated analyzable-sample losses arising from likely prohibited AI use among some control participants, and is calibrated to the power calculation so that the analyzable sample after such exclusions remains adequately powered (see Power Calculation). The treatment allocation process is hidden from participants. Participants subsequently receive the instructions corresponding to their assigned condition. Indicators of likely prohibited tool use in the control group will be defined ex ante using Qualtrics embedded paradata, including changes in browser focus and copying behavior on task pages; any exclusions based on these indicators will be reported transparently.

Randomization Unit

Individual participant

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

307 participants

Sample size: planned number of observations

Treatment group: 138 participants Control group: 169 participants

Sample size (or number of clusters) by treatment arms

The study is powered for the primary outcome, the global productivity index, using a two-sided two-sample test at α=0.05 and target power 1-β=0.80, consistent with the non-directional primary hypothesis that access to generative AI affects productivity. Randomization uses unequal allocation, with 0.45 assigned to treatment and 0.55 assigned to control. The minimum detectable effect size is computed as d_MDE=(z_(1-α/2)+z_(1-β) )×√(1/n_T +1/n_C ), with z_(1-α/2)=1.96 and z_(1-β)=0.84.

For the full assigned sample of N=307 participants (approximately n_T=138 in the treatment group and n_C=169 in the control group), this yields a minimum detectable standardized effect of approximately d=0.32. The over-allocation to the control group is intended to offset the integrity-based exclusions anticipated in the sensitivity analyses. If those exclusions reduce the analyzable control arm toward parity with the treatment arm (roughly n_T=n_C=138) corresponding to approximately N=276 in total, the minimum detectable effect rises modestly to approximately d=0.34. The study is therefore powered to detect standardized effects in the region of d≈0.32–0.34 at 80% power, depending on the number of exclusions.

A minimum detectable effect in this range remains conservative relative to existing experimental evidence on generative AI in comparable knowledge-work tasks. Noy and Zhang (2023) report that ChatGPT access reduced completion time by about 0.8 standard deviations and increased output quality by about 0.4 standard deviations in mid-level professional writing tasks. These benchmarks are not on the same scale as the global productivity index used here, which averages standardized quality and reverse-coded time; the realized composite effect depends on the time–quality correlation and would be attenuated under any speed–quality tradeoff, so the d-range above is treated as a conservative floor rather than a prediction. Dell'Acqua et al. (2023) find that consultants using AI completed 12.2% more tasks, completed tasks 25.1% more quickly, and produced output rated more than 40% higher in quality on tasks within the AI frontier. Brynjolfsson, Li, and Raymond (2025) report a 15% increase in the number of issues resolved per hour by customer support agents. Where these effects can be expressed in standardized terms, as in Noy and Zhang (2023), they exceed d=0.34. Yet, those figures are component-level (separate time and quality effects) rather than composite, so they bound the present endpoint from above only loosely. Powering for an effect in the range of d≈0.32–0.34 provides a margin against smaller effects arising from the shorter and more heterogeneous task battery used here, partial non-use of the AI tool within the treatment arm, and additional noise in completion-time measurement in an unsupervised online setting. Recruitment continues until the planned number of valid completed observations is obtained.

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Supporting Documents and Materials

IRB

Institutional Review Boards (IRBs)

IRB Name

Ghent University Faculty of Economics and Busines Administration Ethics Committee

IRB Approval Date

2026-06-03

IRB Approval Number

UG-EB 2026-AZ

Analysis Plan