Human learning about AI performance

Last registered on December 06, 2023


Trial Information

General Information

Human learning about AI performance
Initial registration date
November 28, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
December 06, 2023, 8:13 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.



Primary Investigator

Harvard University

Other Primary Investigator(s)

Additional Trial Information

In development
Start date
End date
Secondary IDs
Prior work
This trial is based on or builds upon one or more prior RCTs.
We study human perceptions of AI performance in mathematics and what they learn from observed AI performance. We compare prior beliefs and updating patterns between humans and AI, and test predictions of a theoretical model of performance anthropomorphism.
External Link(s)

Registration Citation

Raux, Raphael. 2023. "Human learning about AI performance ." AEA RCT Registry. December 06.
Experimental Details


We develop a survey experiment where Prolific subjects are presented with multiple-choice math questions (taken from different standardized tests). They are incentivized both to solve them and to predict the performance (likelihood of answering correctly) of someone else, either a human or an AI (ChatGPT). Incentives for beliefs will be done using the binarized scoring rule.
Intervention Start Date
Intervention End Date

Primary Outcomes

Primary Outcomes (end points)
(Incentivized) beliefs in performance: both prior beliefs and posterior beliefs, following a one-shot success/mistake on a revealed question. We are interested in comparing the shape of prior and posterior beliefs between humans and AI. Our model makes two main predictions: (i) beliefs in both human and AI performance are decreasing in the (human) difficulty of the task considered; (ii) the posterior belief in both human and AI performance on a "hard" task is lower after revealing a mistake on an "easy" task, compared to the same posterior after revealing a mistake on a (different) "hard" task; conversely, posterior belief on a "hard" task is higher after revealing a success on a (different) "hard" task, compared to revealing a success on an "easy" task. We call this prediction "cross inference".
Primary Outcomes (explanation)
Belief in performance is defined as "% chance [ChatGPT/a randomly-selected human who took the test] answered the question correctly". We constructed an index of difficulty by administering a math test composed of standardized test questions of various grade levels, as described in the registered trial: Task difficulty is then defined as the share of humans who answered the question incorrectly. All questions are multiple-choice with 4 or 5 choices, so random guessing bounds difficulty above at 75-80%.
"Easy" and "hard" tasks are defined as the extremes of the difficulty gradient (first and last deciles, defined with human performance data on 314 humans - more performance data will be collected from the present survey).

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
The survey is composed of 3 parts. The first part provides instructions and familiarizes participants with those they will be asked to predict performance of: humans (Prolific subjects who took the test) or ChatGPT.
The second part presents randomly selected questions (around 10) and asks participants to solve them and predict performance of humans/ChatGPT.
The last part tests the cross inference prediction: elicit prior beliefs, reveal performance on one question, and elicit posterior belief (4 possible conditions varying revealed performance).
The survey will end with basic demographic questions, as well as a question asking about prior AI familiarity.
Experimental Design Details
For the cross inference part: A random "hard" question is presented and prior belief is elicited. Then performance on a random question is revealed (randomly varying the failure/success and easy/hard nature of the question), and posterior beliefs are then elicited and the same hard question.
We for now leave aside the symmetrical part of the cross inference prediction which compares posterior beliefs on easy questions: posterior following an easy success is lower than following a hard success, and posterior is larger following a hard mistake than following an easy mistake. The reason for this is twofold: (i) it helps us increase power by reducing the number of arms; (ii) prior beliefs on AI for easy questions are already close to the maximum (on average, >90%), so this ceiling effect mechanically reduces the chance of detecting a significant effect.
Randomization Method
Randomization is done at the Qualtrics block level between the main treatments (human/AI) and for the cross inference blocks (4 blocks). Then questions shown for the second survey part and randomly sampled from the full pool of questions: we will ensure that the same subject sees randomly sampled questions spanning the full gradient of difficulty.
For cross inference, we create pools of easy/hard questions, on which ChatGPT failed/succeeded (performance is for model GPT 3.5, August 3 version). From these pools, a hard question is randomly drawn to elicit prior and posterior, and another question is drawn from one of the 4 other pools (easy fail, hard fail, easy success, hard success) to reveal performance. Random draw is done through embedded data and conditional question display.
Randomization Unit
Individual survey respondents for the main human/AI treatment. Then, within each group, subjects are randomly assigned to one of the 4 possible cross inference conditions.
Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters
2 main groups of subjects (human/AI condition), each divided into 4 sub-groups (cross inference condition).
Sample size: planned number of observations
Around 1000-1100 total subjects, excluding those failing attention checks or comprehension questions (slightly more subjects will be recruited to end up with this number).
Sample size (or number of clusters) by treatment arms
Around 800 for AI treatment, then divided into 4 groups of around 200 for each cross inference condition. Same for the human treatment, but with around 250-300, divided into 4 groups of 62-75 subjects.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Institutional Review Boards (IRBs)

IRB Name
Committee on the use of human subjects - Harvard University
IRB Approval Date
IRB Approval Number


Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information


Is the intervention completed?
Data Collection Complete
Data Publication

Data Publication

Is public data available?

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials