Back to History

Fields Changed

Registration

Field Before After
Last Published July 29, 2024 04:31 PM August 24, 2024 05:22 PM
Intervention (Public) The intervention involves two steps, with two separate surveys. First, an "initial labeling" which has a first sample of subjects rate the quality of randomly drawn real user-AI conversations, and establish a measure of the "reasonableness" of AI misunderstandings. The conversations (defined as 1 user query and 1 AI answer) are arranged by "pairs", where in each pair the user queries have the same "intent" (i.e., ask for the same thing) and the AI misunderstands the query in both cases. These misunderstandings vary in how "reasonable" they are, from a human perspective: we will explicitly look to obtain conversation pairs for which there is a large "reasonableness gap" in AI answers. Among the candidate conversation pairs we will rate, we will focus on the subset for which the average gap is largest, and the variance of reasonableness is lowest. As a verification of the conversation selection process, we may also elicit a coarser measure of usefulness, and select pairs which are rated as not useful. The second step will have another subject sample observe conversations among the shortlist we establish in the first step. We will experimentally vary the "reasonableness" of misunderstanding: one arm will see "high-reasonableness" misunderstandings, while the other will see "low-reasonableness" misunderstandings. Subjects will then be given the opportunity to engage with the AI chatbot, either directly on the survey or by a link, and we will measure engagement as the main outcome. To increase the baseline willingness to engage with the chatbot, we will include in both treatments a number of high-quality conversations (where the AI understood the question properly and provided a useful answer), which we will hold constant across treatments. The intervention involves two steps, with two separate surveys. First, an "initial labeling" which has a first sample of subjects rate the quality of randomly drawn real user-AI conversations, and establish a measure of the "reasonableness" of AI misunderstandings. The conversations (defined as 1 user query and 1 AI answer) are arranged by "pairs", where in each pair the user queries have the same "intent" (i.e., ask for the same thing) and the AI misunderstands the query in both cases. These misunderstandings vary in how "reasonable" they are, from a human perspective: we will explicitly look to obtain conversation pairs for which there is a large "reasonableness gap" in AI answers. Among the candidate conversation pairs we will rate, we will focus on the subset for which the average gap is largest, and the variance of reasonableness is lowest. Among those, we select a subset of pairs which do not present significant differences in "usefulness" between the 2 parts of the pair. Usefulness is elicited using a highly similar to the one used for reasonableness. We establish a final list of 5 pairs of low-quality conversations (along with 3 high-quality ones), that we use in the second step: the present both a large delta in reasonableness and a small and insignificant delta in usefulness (based on median and averages, rated by current or expecting parents). The second step will have another subject sample observe conversations among the shortlist we establish in the first step. We will experimentally vary the "reasonableness" of misunderstanding: one arm will see "high-reasonableness" misunderstandings, while the other will see "low-reasonableness" misunderstandings. Subjects will then be given the opportunity to engage with the AI chatbot via a link, and we will measure engagement as the main outcome. To increase the baseline willingness to engage with the chatbot, we will include in both treatments a number of high-quality conversations (where the AI understood the question properly and provided a useful answer), which we will hold constant across treatments. For each conversation, we will elicit beliefs in chatbot performance (incentivized) and trust in the chatbot, which will also be main outcome variables.
Primary Outcomes (End Points) User engagement with the chatbot, taking the form of willingness to ask questions to the chatbot. Depending on technical and logistical constraints, these questions might be collected within the survey itself, or a link to the chatbot may be provided to subjects for them to engage with it in a naturalistic setting. The main prediction is that user engagement is higher in the "high reasonableness" treatment, compared to the "low reasonableness", which we interpret as subjects' inference that the chatbot's ability to perform its task is lower when the misunderstanding is less reasonable from a human perspective. User engagement with the chatbot, taking the form of willingness to use the chatbot. We offer the choice to get a link to the chatbot or to parenting articles, provided at the end of the survey. The main prediction is that user engagement (share choosing chatbot) is higher in the "high reasonableness" treatment, compared to the "low reasonableness", which we interpret as subjects' inference that the chatbot's ability to perform its task (provide useful information) is lower when the misunderstanding is less reasonable from a human perspective. The analysis will also look at incentivized beliefs in chatbot performance (in %) and trust in the chatbot (1-7 scale); we predict higher measures of both in the "high reasonableness" treatment.
Primary Outcomes (Explanation) Engagement will first be defined as binary: does the subject choose the option to ask any number of questions to the chatbot. Engagement is defined as binary: does the subject choose the chatbot or the articles.
Experimental Design (Public) The main treatment varies the reasonableness of conversations. All subjects will see a few low-quality conversations (useless answers) between users and AI, chosen among the suitable pairs. Subjects in the "high" treatment will see the conversation that was rated as more reasonable (relatively speaking), while subjects in the "low" treatment will see the one rated as less reasonable. This is the only treatment difference. A sub-arm may be included where we vary the relative share of high-quality (useful answers) and low-quality (useless answers) conversations: the corresponding prediction is that the likelihood of user engagement with AI increases with the number of high-quality conversations seen. Initial labeling: We start by screening for parents of young children and currently expecting parents, in order for our sample to be similar to the chatbot's user base. After a consent form, subjects see experimental instructions, which describe the task: they will see a number of real conversations in which answers were deemed unhelpful because the AI misunderstood the question. They will have to assess how reasonable the various misunderstandings are from a human perspective. We will make clear and explicit that we are not asking about how useful/helpful the answers are (they are all unhelpful) but rather how reasonable is the misunderstanding. As a control, we might also elicit a coarser measure of perceived usefulness of answers. The survey will end with usual demographic questions. Experiment: We start by screening for parents of young children and currently expecting parents, in order for our sample to be similar to the chatbot's user base. After a consent form, subjects see experimental instructions, which describe the task: they will see a number of real user-AI conversations and will have the opportunity to engage with the AI at the end. As explained above, the main treatment will vary the reasonableness of the AI answers in the low-quality conversations. Sub-arms will vary the relative share of high- and low-quality conversations. The survey will end with usual demographic questions. The main treatment varies the reasonableness of conversations. All subjects will see a few low-quality conversations (not useful answers) between users and AI, chosen among the suitable pairs. Subjects in the "high" treatment will see the conversation that was rated as more reasonable (relatively speaking), while subjects in the "low" treatment will see the one rated as less reasonable. This is the only treatment difference. A sub-arm may be included where we vary the relative share of high-quality (useful answers) and low-quality (not so useful answers) conversations: the corresponding prediction is that the likelihood of user engagement with AI increases with the number of high-quality conversations seen. Initial labeling: We start by screening for parents of young children and currently expecting parents, in order for our sample to be similar to the chatbot's user base. After a consent form, subjects see experimental instructions, which describe the task: they will see a number of real conversations in which answers were deemed unhelpful because the AI misunderstood the question. They will have to assess how reasonable the various misunderstandings are from a human perspective. We will make clear and explicit that we are not asking about how useful/helpful the answers are (they are all unhelpful) but rather how reasonable is the misunderstanding. We then run a similar design to elicit "usefulness" of answers and select only the pairs that are rated as equally-useful (not very useful in our case). Experiment: We start by screening for parents of young children and currently expecting parents, in order for our sample to be similar to the chatbot's user base. After a consent form, subjects see experimental instructions, which describe the task: they will see a number of real user-AI conversations and will have the opportunity to engage with the AI at the end. As explained above, the main treatment will vary the reasonableness of the AI answers in the low-quality conversations. Subjects draw 2 low-quality conversations among a final list of 5. Sub-arms may vary the relative share of high- and low-quality conversations. The main outcomes are elicited after the conversations: incentivized beliefs in performance and trust, as well as requesting a link to the chatbot or to parenting articles. The survey will end with usual demographic questions.
Planned Number of Clusters Initial labeling: around 20 elicitations per candidate conversation, with an initial list of 80 conversations (40 pairs). Each subject will rate between 10 and 20 conversations, thus between 80 and 160 subjects. Engagement experiment: between 50 and 150 subjects per treatment arm. To be refined based on the observed delta in reasonableness between high and low conversations. Initial labeling: around 30-40 elicitations per conversation for reasonableness, with an initial list of 80 conversations (40 pairs). Each subject rates 20 conversations, thus around 200 subjects. Then, around 20 elicitations per conversation for usefulness (1-5 scale), around 100 subjects. Engagement experiment: total of around 900-1000 clusters.
Planned Number of Observations Initial labeling: between 80 and 160 subjects. Experiment: between 200 and 600 subjects (to be refined as said above). Initial labeling: total of around 300 subjects, in two separate samples for reasonableness and usefulness. Total of around 900-1000 subjects.
Sample size (or number of clusters) by treatment arms Initial labeling: between 80 and 160 subjects. Experiment: between 200 and 600 subjects (to be refined as said above). Initial labeling: total of around 300 subjects, in two separate samples for reasonableness and usefulness. Experiment: around 450-500 subjects per arm.
Secondary Outcomes (End Points) As a secondary analysis, we will look, conditional on engaging, at the number of questions and their type/wording, as well as the answers they generate. Regardless of the number of questions asked, evidence of a more "cautious" behavior of subjects (e.g., higher likelihood to ask "test questions" or low-stakes questions) will be interpreted in line of the main hypothesis: that lower reasonableness of conversations seen will decrease belief in chatbot's ability more (compared to high-reasonableness), and induce more cautious behavior from subjects. Conditional on obtaining a snapshot of chatbot conversations from the website, parentdata.org, we may study the persistence of user engagement a few weeks after the experiment is concluded. We will look, conditional on engaging, at the number of questions and their type/wording, as well as the answers they generate. Regardless of the number of questions asked, evidence of a more "cautious" behavior of subjects (e.g., higher likelihood to ask "test questions" or low-stakes questions) will be interpreted in line of the main hypothesis: that lower reasonableness of conversations seen will decrease belief in chatbot's ability more (compared to high-reasonableness), and induce more cautious behavior from subjects. If the conversation rate (subjects who actually use chatbot at all when choosing the link) is particularly low we may recruit additional subjects as a follow-up.
Back to top