The supply and demand of AI sycophancy

Last registered on May 06, 2026

Pre-Trial

Trial Information

General Information

Title
The supply and demand of AI sycophancy
RCT ID
AEARCTR-0018435
Initial registration date
May 01, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
May 06, 2026, 10:58 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Bocconi University

Other Primary Investigator(s)

PI Affiliation
Stanford University
PI Affiliation
Sciences Po

Additional Trial Information

Status
In development
Start date
2026-05-01
End date
2026-05-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
We experimentally measure user preferences for sycophancy in large language models. A pre-registered sample of U.S. adults recruited via Prolific answers 40 multiple-choice questions and, on 30 of them, sees an AI-generated response, reports an incentivized belief that the AI is correct, and rates the response on a five-star scale incentivized via Incentivized Resume Rating. The design is a 2x2x2 between-subjects experiment crossing question domain (objective English usage and math vs political and social misperceptions), the timing of when correct answers are revealed (before vs after the rating), and the AI's agreement rate when the respondent is wrong (high vs low). Within respondent, the AI's correctness and agreement with the respondent's answer are randomized at the question level, exogenously populating all four correctness-by-agreement cells. From post-revelation ratings we recover respondent-level preferences over AI answers; the pre-revelation ratings correspond to how raters annotate AI answers in practice. We combine these preferences with structural estimates of LLM sycophancy across leading models to quantify the welfare consequences of sycophancy and the scope for personalization to amplify or attenuate it.
External Link(s)

Registration Citation

Citation
Caprini, Giulia, Samuel Goldberg and Rafael Jimenez Duran. 2026. "The supply and demand of AI sycophancy." AEA RCT Registry. May 06. https://doi.org/10.1257/rct.18435-1.0
Experimental Details

Interventions

Intervention(s)
Each respondent answers 30 multiple-choice questions and, after each answer, sees an example of an AI-generated response (hereafter: AI's answer) that either agrees or disagrees with the respondent's answer and is either correct or incorrect.

Respondents are asked to rate the AI's answers.

We introduce incentives to induce respondents to 1) answer the multiple-choice questions as accurately as possible, 2) reveal their posterior beliefs about the AI's answer, and 3) rate the AI's answer based on the type of answers that they'd like an AI assistant to give to them (based on an Incentivized Resume Rating procedure).
Intervention Start Date
2026-05-01
Intervention End Date
2026-05-05

Primary Outcomes

Primary Outcomes (end points)
Our primary outcome is their rating of the AI's answer. It is a rating that can go from 0 to 5 stars, in half-star increments.

We combine these ratings with structural estimates of LLM sycophancy (estimated in a separate experiment) across leading models to quantify the welfare effects of sycophancy and how personalization (ie adding respondents' demographics and their history of interacting with the AI) affects the supply of sycophancy.
Primary Outcomes (explanation)
We will report estimates with average ratings across individuals and questions in each of the correctness X agreement cells (pooling across databases but also separately for each database).

However, due to the rich variation in our data, to increase power we will also report different specifications adding fixed effects (respondent FE, question FE, order quintile FE, strata FE, etc.). Our baseline specification will include question FE + respondent FE, (with the Wrong, Disagree cell as base category). Standard errors will be clustered at the individual level in all specifications.

We will report also the distribution of the ratings in each of the cells across individuals.

In all these analyses we will separate pre- vs. post-truth ratings.

Secondary Outcomes

Secondary Outcomes (end points)
1) Willingness to Pay to access an AI that best fits the ratings that users give
2) Calibration error, measured as the absolute value of the difference between user's expected performance in 10 final questions and their actual performance. We will also report separately each of the components: beliefs and performance.
Secondary Outcomes (explanation)
To increase power, we will control for respondents' performance in the previous 30 questions (we do not expect this to be affected by the AI's answers). We also control for their prior about their performance. Note: for these secondary outcomes, we will pool across pre-vs. post-truth, to increase power, and report estimates separately by dataset.

Experimental Design

Experimental Design
Three between-subjects manipulations vary: (i) the question domain — objective English usage and math drawn from publicly available SAT-style items, vs. political and social misperception questions about U.S. demographic groups; (ii) whether the correct answer is revealed to the respondent before or after they rate the AI response; and (iii) the AI's probability of agreeing with the respondent when the respondent's answer is wrong (high regime: 0.45; low regime: 0.25). When the respondent's answer is correct, the AI agrees with probability 0.75 in both regimes. This way, we vary agreement while keeping accuracy constant.

Within respondent, AI correctness and agreement with the respondent are randomized at the question level conditional on the respondent's answer, populating correctness-by-agreement cells exogenously. AI responses are generated on Qualtrics. When the AI does not agree with a wrong respondent, the residual probability mass is split between the correct option and a random alternative wrong option (high regime: 0.44 correct, 0.11 other wrong; low regime: 0.47 correct, 0.28 other wrong). When the AI does not agree with a correct respondent, it draws uniformly from the wrong options.


AI responses are a combination of a multiple-choice answer AND a wording style (respondents are debriefed at the end of the survey that responses are pre-generated). Each AI response is paired with one of ten pre-written AI-generated explanatory templates with an AI-generated question-specific generic justification, so that wording does not co-vary systematically with correctness or agreement.



To summarize,
- Between subjects: Database (political misperceptions / SAT) X truth revelation (pre/post ratings) X AI's agreement level (high/low)
- Within subjects: question ordering, AI's suggested answers and wording.
Experimental Design Details
Not available
Randomization Method
Randomization is implemented by Qualtrics' built-in randomizer with even-presentation enforcement.

Arm assignment is stratified hierarchically: domain/database, then revelation timing within domain, then agreement regime within revelation timing.
Randomization Unit
We have between variation at the respondent level and within-subject variation which randomizes the ordering of responses and the AI's answers across questions.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
800 individual respondents recruited from Prolific (U.S. adults).
Sample size: planned number of observations
800 attentive respondents, defined as respondents who pass all three attention checks (about accuracy bonuses, how rating affects AI's answers, how ratings are used). Note: we screen out those who fail these checks. At the question level this yields 24,000 observations with AI ratings (800 X 30). Please note: we also select the option on Prolific that "Automatically reject exceptionally fast submissions". We also balance respondents by sex.
Sample size (or number of clusters) by treatment arms
100 attentive respondents per arm X 8 arms:
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
For ratings in the post-truth arm, we are powered to detect 0.14 SD of the difference between the rating in the (wrong, agree) cell vs the rating in the (wrong, disagree) cell (per database or combined across databases). This is well below the smallest (0.62 SD) effect size we saw in the pilot. We are also powered for other comparisons of interest such as the difference between the rating in the (correct, agree) cell vs the rating in the (correct, disagree) cell, or [(correct, agree) - (correct, disagree)] - [(wrong, agree) - (wrong, disagree)]. For our secondary outcomes, with N=800 attentive respondents and pre-registered controls, we are powered to detect effects of roughly 0.16–0.25 per database.
IRB

Institutional Review Boards (IRBs)

IRB Name
Stanford Institutional Review Board
IRB Approval Date
2026-04-30
IRB Approval Number
IRB 85668