Evaluating the impact of AI-powered reasoning in deliberative processes

Last registered on February 28, 2025

Pre-Trial

Trial Information

General Information

Title
Evaluating the impact of AI-powered reasoning in deliberative processes
RCT ID
AEARCTR-0015461
Initial registration date
February 26, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
February 28, 2025, 10:40 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Stanford University

Other Primary Investigator(s)

PI Affiliation
University of California, Berkeley
PI Affiliation
Massachusetts Institute of Technology
PI Affiliation
Stanford University
PI Affiliation
Stanford University
PI Affiliation
ETH Zurich
PI Affiliation
Massachusetts Institute of Technology

Additional Trial Information

Status
In development
Start date
2025-02-01
End date
2025-12-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This pre-analysis plan describes an experimental study evaluating whether AI-powered reasoning interventions and monetary incentives can enhance argument quality in an online policy debate on mental illness and U.S. gun regulation. Participants will be randomly assigned to either a reasoning (treatment) condition or a control condition, with further cross-randomization of direct or lottery-based financial rewards for thoughtful contributions. The primary outcomes include measures of argument thoughtfulness and compromise. In addition, secondary analyses will examine potential mechanisms, such as cognitive reflection, emotional state, diversity of perspectives, and engagement. This study seeks to identify practical, scalable methods for improving deliberative discourse in digital contexts.
External Link(s)

Registration Citation

Citation
Braley, Alia et al. 2025. "Evaluating the impact of AI-powered reasoning in deliberative processes." AEA RCT Registry. February 28. https://doi.org/10.1257/rct.15461-1.0
Experimental Details

Interventions

Intervention(s)
We test two main interventions for improving argument quality in an online policy debate on mental illness and firearm regulation in the United States:

1) AI-powered Reasoning: Participants receive prompts from a large language model (LLM) designed to foster more reflective, logically coherent arguments (i.e., Socratic dialogue), solo guided reflection (i.e., Reflective paragraph); LLM-powered emotional regulation (i.e., Emotional-regulation LLM); grammar correction (i.e., Grammar-correction LLM), or waiting control.

2) Monetary Incentives: Participants receive the option to receive direct monetary bonus (for top quartile arguments), a lottery-based bonus (for top quartile arguments), or no incentive.
Intervention Start Date
2025-02-06
Intervention End Date
2025-02-28

Primary Outcomes

Primary Outcomes (end points)
1. Argument thoughtfulness score: A composite measure (0–6 scale) reflecting logical structure, clarity, and depth of each participant’s main comment.

2. Position shifting or compromise: The extent to which participants move from an extreme stance toward a middle-ground position on the policy.
Primary Outcomes (explanation)
1. Argument Thoughtfulness Score reflecting three dimensions (clarity, coherence, and depth), each rated by independent coders (Human, LLM, and index of both).

2. Position shifting is computed as the absolute change in stance on a -5 to +5 slider (e.g., from strongly oppose to moderately oppose). Additionally, we consider endorsement of seeded comments and a policy proposal, as well as willingness to donate to a non-profit organization.

Secondary Outcomes

Secondary Outcomes (end points)
1. Cognitive Reflection: Changes in CRT scores (pre-post).
2. Emotional State: Pre-post changes on a 2-dimension affect scale.
3. Perspective Diversity: Type-token ratio, semantic similarity variance, voting weighted average.
4. Engagement Indices: Number of comments edited or length of comment, voting behavior, willingness to donate, time spent.
Secondary Outcomes (explanation)
1. Cognitive Reflection is measured via a standard 2 CRT questions, taking the difference between baseline and endline.
2. Emotional State is captured using self-reported activation and pleasure scales (0–10).
3. Perspective Diversity in textual output is assessed via NLP metrics such as TTR (type-token ratio) and semantic embedding variance. Also, we estimate voting weighted average, considering the participant's self-reported position and agreement on seeded comments.
4. Engagement includes votes on others’ comments, revisions made, willingness to donate (binary), and time spent.

Experimental Design

Experimental Design
We conduct an online experiment with participants recruited to debate whether people with mental illnesses should be allowed to purchase firearms. We randomize at the individual level along two dimensions:
(i) reasoning interventions (Socratic dialogue LLM vs. Reflective paragraph vs. Emotional-regulation LLM vs. Grammar-correction LLM vs. waiting control) and
(ii) monetary incentives (direct bonus vs. lottery vs. none).

The primary objective is to determine which intervention(s) most effectively increase argument thoughtfulness and willingness to compromise. We measure outcomes post-treatment via rubric-based coding of each participant’s main argument, changes in stance, and secondary indicators like voting and donation behavior.

Final analyses will compare each treatment arm with pooled or disaggregated controls, depending on tests of equivalence among control conditions.

Details in Pre-Analysis Plan attached.
Experimental Design Details
Not available
Randomization Method
Pseudo-random generator within the survey/intervention platform (deliberation.io). Each participant has equal probability of assignment to each cell in the 5×3 factorial design.
Randomization Unit
Individual.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
N/A.
Sample size: planned number of observations
1,500 when analyzing individual-level outcomes. Up to 18,000 (1500 x 12) when analyzing individual-vote level outcomes.
Sample size (or number of clusters) by treatment arms
1500: 300 for each reasoning condition; 500 for each monetary incentive condition.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
1. Argument Thoughtfulness score: Pilot data (non-registered pilot in September 2024) suggests a SD ≈ 2.28 on a 0–6 scale. With n=1,500, we can detect a mean difference of roughly 0.3–0.55 points (on 6-point scale) with 80% power. 2. Position Shift (Compromise): Pilot data (non-registered pilot in September 2024) suggests a SD ≈ 0.458 on a binary measure. With n=1,500, we can detect a mean difference of roughly 6-10 percentage points with 80% power.
IRB

Institutional Review Boards (IRBs)

IRB Name
Stanford University
IRB Approval Date
2024-05-31
IRB Approval Number
75616
IRB Name
Massachusetts Institute of Technology
IRB Approval Date
2024-04-08
IRB Approval Number
5798
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information