Human assessment of chatbot misunderstandings

Last registered on August 24, 2024

Pre-Trial

Trial Information

General Information

Title
Human assessment of chatbot misunderstandings
RCT ID
AEARCTR-0014039
Initial registration date
July 20, 2024

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
July 29, 2024, 4:31 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
August 24, 2024, 5:22 PM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Region

Primary Investigator

Affiliation
Harvard University

Other Primary Investigator(s)

Additional Trial Information

Status
In development
Start date
2024-07-20
End date
2024-08-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
We will study the reaction that potential users of an AI chatbot have when exposed to various misunderstandings it made in the context of parenting conversations. We will first look at acquire a measure of "how reasonable", from a human perspective, such misunderstandings are deemed to be. Then, we will study how future user engagement with the chatbot is affected by such reasonableness, by varying the types of conversations potential user are exposed to.
External Link(s)

Registration Citation

Citation
Raux, Raphael. 2024. "Human assessment of chatbot misunderstandings." AEA RCT Registry. August 24. https://doi.org/10.1257/rct.14039-1.1
Experimental Details

Interventions

Intervention(s)
The intervention involves two steps, with two separate surveys. First, an "initial labeling" which has a first sample of subjects rate the quality of randomly drawn real user-AI conversations, and establish a measure of the "reasonableness" of AI misunderstandings. The conversations (defined as 1 user query and 1 AI answer) are arranged by "pairs", where in each pair the user queries have the same "intent" (i.e., ask for the same thing) and the AI misunderstands the query in both cases. These misunderstandings vary in how "reasonable" they are, from a human perspective: we will explicitly look to obtain conversation pairs for which there is a large "reasonableness gap" in AI answers. Among the candidate conversation pairs we will rate, we will focus on the subset for which the average gap is largest, and the variance of reasonableness is lowest. Among those, we select a subset of pairs which do not present significant differences in "usefulness" between the 2 parts of the pair. Usefulness is elicited using a highly similar to the one used for reasonableness. We establish a final list of 5 pairs of low-quality conversations (along with 3 high-quality ones), that we use in the second step: the present both a large delta in reasonableness and a small and insignificant delta in usefulness (based on median and averages, rated by current or expecting parents).

The second step will have another subject sample observe conversations among the shortlist we establish in the first step. We will experimentally vary the "reasonableness" of misunderstanding: one arm will see "high-reasonableness" misunderstandings, while the other will see "low-reasonableness" misunderstandings. Subjects will then be given the opportunity to engage with the AI chatbot via a link, and we will measure engagement as the main outcome. To increase the baseline willingness to engage with the chatbot, we will include in both treatments a number of high-quality conversations (where the AI understood the question properly and provided a useful answer), which we will hold constant across treatments. For each conversation, we will elicit beliefs in chatbot performance (incentivized) and trust in the chatbot, which will also be main outcome variables.
Intervention Start Date
2024-07-31
Intervention End Date
2024-08-30

Primary Outcomes

Primary Outcomes (end points)
User engagement with the chatbot, taking the form of willingness to use the chatbot. We offer the choice to get a link to the chatbot or to parenting articles, provided at the end of the survey.
The main prediction is that user engagement (share choosing chatbot) is higher in the "high reasonableness" treatment, compared to the "low reasonableness", which we interpret as subjects' inference that the chatbot's ability to perform its task (provide useful information) is lower when the misunderstanding is less reasonable from a human perspective.
The analysis will also look at incentivized beliefs in chatbot performance (in %) and trust in the chatbot (1-7 scale); we predict higher measures of both in the "high reasonableness" treatment.
Primary Outcomes (explanation)
Engagement is defined as binary: does the subject choose the chatbot or the articles.

Secondary Outcomes

Secondary Outcomes (end points)
Conditional on obtaining a snapshot of chatbot conversations from the website, parentdata.org, we may study the persistence of user engagement a few weeks after the experiment is concluded. We will look, conditional on engaging, at the number of questions and their type/wording, as well as the answers they generate. Regardless of the number of questions asked, evidence of a more "cautious" behavior of subjects (e.g., higher likelihood to ask "test questions" or low-stakes questions) will be interpreted in line of the main hypothesis: that lower reasonableness of conversations seen will decrease belief in chatbot's ability more (compared to high-reasonableness), and induce more cautious behavior from subjects. If the conversation rate (subjects who actually use chatbot at all when choosing the link) is particularly low we may recruit additional subjects as a follow-up.


Secondary Outcomes (explanation)

Experimental Design

Experimental Design
The main treatment varies the reasonableness of conversations. All subjects will see a few low-quality conversations (not useful answers) between users and AI, chosen among the suitable pairs. Subjects in the "high" treatment will see the conversation that was rated as more reasonable (relatively speaking), while subjects in the "low" treatment will see the one rated as less reasonable. This is the only treatment difference. A sub-arm may be included where we vary the relative share of high-quality (useful answers) and low-quality (not so useful answers) conversations: the corresponding prediction is that the likelihood of user engagement with AI increases with the number of high-quality conversations seen.


Initial labeling: We start by screening for parents of young children and currently expecting parents, in order for our sample to be similar to the chatbot's user base. After a consent form, subjects see experimental instructions, which describe the task: they will see a number of real conversations in which answers were deemed unhelpful because the AI misunderstood the question. They will have to assess how reasonable the various misunderstandings are from a human perspective. We will make clear and explicit that we are not asking about how useful/helpful the answers are (they are all unhelpful) but rather how reasonable is the misunderstanding. We then run a similar design to elicit "usefulness" of answers and select only the pairs that are rated as equally-useful (not very useful in our case).

Experiment: We start by screening for parents of young children and currently expecting parents, in order for our sample to be similar to the chatbot's user base. After a consent form, subjects see experimental instructions, which describe the task: they will see a number of real user-AI conversations and will have the opportunity to engage with the AI at the end. As explained above, the main treatment will vary the reasonableness of the AI answers in the low-quality conversations. Subjects draw 2 low-quality conversations among a final list of 5.
Sub-arms may vary the relative share of high- and low-quality conversations. The main outcomes are elicited after the conversations: incentivized beliefs in performance and trust, as well as requesting a link to the chatbot or to parenting articles.
The survey will end with usual demographic questions.
Experimental Design Details
We will also elicit prior familiarity with Dewey, the chatbot used, and perform the main analysis on a sample that excludes subjects reporting strong familiarity, since our desired sample is made of potential users, not current ones.
We elicit familiarity with AI systems in general: we expect that highly familiar subjects may be less subject to our mechanism of "performance anthropomorphism", so results will be weaker for them.
We will include a simple comprehension question and attention check as a measure of data quality; the analysis will be performed with and without.
Randomization Method
Randomization done through the Qualtrics survey: both built-in randomization (block level) and custom javascript (drawn without replacement) to vary the types of conversations subjects will see.
Randomization Unit
Individual survey participant.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
Initial labeling: around 30-40 elicitations per conversation for reasonableness, with an initial list of 80 conversations (40 pairs). Each subject rates 20 conversations, thus around 200 subjects. Then, around 20 elicitations per conversation for usefulness (1-5 scale), around 100 subjects.
Engagement experiment: total of around 900-1000 clusters.
Sample size: planned number of observations
Initial labeling: total of around 300 subjects, in two separate samples for reasonableness and usefulness. Total of around 900-1000 subjects.
Sample size (or number of clusters) by treatment arms
Initial labeling: total of around 300 subjects, in two separate samples for reasonableness and usefulness.
Experiment: around 450-500 subjects per arm.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
IRB

Institutional Review Boards (IRBs)

IRB Name
Committee on the use of human subjects - Harvard University
IRB Approval Date
2024-07-17
IRB Approval Number
IRB23-0588 (update)

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials