Evaluating the impact of AI-powered reasoning in deliberative processes

Last registered on August 01, 2025

Pre-Trial

Trial Information

General Information

Title
Evaluating the impact of AI-powered reasoning in deliberative processes
RCT ID
AEARCTR-0015461
Initial registration date
February 26, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
February 28, 2025, 10:40 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
August 01, 2025, 4:41 PM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Region

Primary Investigator

Affiliation
Stanford University

Other Primary Investigator(s)

PI Affiliation
University of California, Berkeley
PI Affiliation
Massachusetts Institute of Technology
PI Affiliation
Stanford University
PI Affiliation
Stanford University
PI Affiliation
ETH Zurich
PI Affiliation
Massachusetts Institute of Technology

Additional Trial Information

Status
In development
Start date
2025-02-01
End date
2025-12-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This pre-analysis plan describes an experimental study evaluating whether AI-powered reasoning interventions and monetary incentives can enhance argument quality in an online policy debate on mental illness and U.S. gun regulation. Participants will be randomly assigned to either a reasoning (treatment) condition or a control condition, with further cross-randomization of direct or lottery-based financial rewards for thoughtful contributions. The primary outcomes include measures of argument thoughtfulness and compromise. In addition, secondary analyses will examine potential mechanisms, such as cognitive reflection, emotional state, diversity of perspectives, and engagement. This study seeks to identify practical, scalable methods for improving deliberative discourse in digital contexts.
External Link(s)

Registration Citation

Citation
Braley, Alia et al. 2025. "Evaluating the impact of AI-powered reasoning in deliberative processes." AEA RCT Registry. August 01. https://doi.org/10.1257/rct.15461-2.0
Experimental Details

Interventions

Intervention(s)
We test two main interventions for improving argument quality in an online policy debate on mental illness and firearm regulation in the United States:

1) AI-powered Reasoning: Participants receive prompts from a large language model (LLM) designed to foster more reflective, logically coherent arguments (i.e., Socratic dialogue), solo guided reflection (i.e., Reflective paragraph); LLM-powered emotional regulation (i.e., Emotional-regulation LLM); grammar correction (i.e., Grammar-correction LLM), or waiting control.

2) Monetary Incentives: Participants receive the option to receive direct monetary bonus (for top quartile arguments), a lottery-based bonus (for top quartile arguments), or no incentive.
Intervention (Hidden)
Reasoning conditions:
1. Socratic Dialogue (LLM-facilitated): Participants engage in a conversation with an LLM that uses the Socratic method. The LLM primarily asks open-ended, reflective questions about the participant’s argument or stance, prompting them to think critically and examine their reasoning. The LLM does not provide direct answers or opinions on the topic but encourages deeper reflection through probing questions.

2. Reflective Paragraph (No LLM): In this condition, there is no AI assistance. Participants are instructed to write one paragraph encompassing 4 to 5 sentences supporting their perspectives. This is, the treatment encourages participants to effectively engage in self reflection without any external prompts. This serves as a baseline for the effect of self-guided reflection.

3. Emotion-Regulation (LLM-facilitated): Participants interact with an LLM that focuses on participant’s emotional reactions to the topic. The LLM prompts the user to identify and reflect on their feelings about the issue (e.g., frustration, enthusiasm, anger) and guides them through basic emotion-regulation strategies (such as reappraising the situation or taking a neutral perspective). The objective of this condition is to test if acknowledging and managing emotions influences subsequent reasoning. This will allow us to test whether emotional regulation is driving the interaction with the LLM. The LLM’s responses center on the participant’s emotional state rather than the
content of their argument.

4. Grammar-Correcting (LLM-facilitated): In this placebo condition, participants receive assistance from an LLM that strictly provides stylistic improvements to their writing. For example, the LLM might correct grammar, fix spelling, or suggest clearer phrasing, but it does not offer any feedback on the argument’s content, logic, or depth. This condition controls for any effect of interacting with an LLM and improved language clarity, without the benefits of deeper reflection or content-oriented guidance.

5. Waiting Control: Participants in this condition do nothing related to the task for 2 minutes, the average time it would take to complete the reflective paragraph or the Socratic Dialogue treatment. They are asked to wait while the page is loading. This controls for the mere passage of time and any potential rest or “incubation effect,” without engaging in additional writing or receiving any feedback.


Monetary incentives:
1. Direct monetary incentive: Participants are given the opportunity to obtain an additional $1 USD (i.e., a 40% increase in their guaranteed, base payment) if their argument ranks within the top 25% of thoughtfulness as rated by the research team.

2. Lottery: Participants are given the opportunity to enter a lottery for a $10 USD bonus (i.e., 400% increase in their guaranteed, base payment). Participants whose argument ranks within the top 25% of thoughtfulness as rated by the research team. The lottery has a 10% chance of success (i.e., expected value of $1 USD).

3. No monetary incentive: Participants receive no additional monetary incentives beyond the base payment. No further instructions are included.

Details in Pre-Analysis Plan attached.
Intervention Start Date
2025-02-06
Intervention End Date
2025-02-28

Primary Outcomes

Primary Outcomes (end points)
1. Argument thoughtfulness score: A composite measure (0–6 scale) reflecting logical structure, clarity, and depth of each participant’s main comment.

2. Position shifting or compromise: The extent to which participants move from an extreme stance toward a middle-ground position on the policy.
Primary Outcomes (explanation)
1. Argument Thoughtfulness Score reflecting three dimensions (clarity, coherence, and depth), each rated by independent coders (Human, LLM, and index of both).

2. Position shifting is computed as the absolute change in stance on a -5 to +5 slider (e.g., from strongly oppose to moderately oppose). Additionally, we consider endorsement of seeded comments and a policy proposal, as well as willingness to donate to a non-profit organization.

Secondary Outcomes

Secondary Outcomes (end points)
1. Cognitive Reflection: Changes in CRT scores (pre-post).
2. Emotional State: Pre-post changes on a 2-dimension affect scale.
3. Perspective Diversity: Type-token ratio, semantic similarity variance, voting weighted average.
4. Engagement Indices: Number of comments edited or length of comment, voting behavior, willingness to donate, time spent.
Secondary Outcomes (explanation)
1. Cognitive Reflection is measured via a standard 2 CRT questions, taking the difference between baseline and endline.
2. Emotional State is captured using self-reported activation and pleasure scales (0–10).
3. Perspective Diversity in textual output is assessed via NLP metrics such as TTR (type-token ratio) and semantic embedding variance. Also, we estimate voting weighted average, considering the participant's self-reported position and agreement on seeded comments.
4. Engagement includes votes on others’ comments, revisions made, willingness to donate (binary), and time spent.

Experimental Design

Experimental Design
We conduct an online experiment with participants recruited to debate whether people with mental illnesses should be allowed to purchase firearms. We randomize at the individual level along two dimensions:
(i) reasoning interventions (Socratic dialogue LLM vs. Reflective paragraph vs. Emotional-regulation LLM vs. Grammar-correction LLM vs. waiting control) and
(ii) monetary incentives (direct bonus vs. lottery vs. none).

The primary objective is to determine which intervention(s) most effectively increase argument thoughtfulness and willingness to compromise. We measure outcomes post-treatment via rubric-based coding of each participant’s main argument, changes in stance, and secondary indicators like voting and donation behavior.

Final analyses will compare each treatment arm with pooled or disaggregated controls, depending on tests of equivalence among control conditions.

Details in Pre-Analysis Plan attached.
Experimental Design Details
Details in Pre-Analysis Plan attached.
Randomization Method
Pseudo-random generator within the survey/intervention platform (deliberation.io). Each participant has equal probability of assignment to each cell in the 5×3 factorial design.
Randomization Unit
Individual.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
N/A.
Sample size: planned number of observations
Initial intervention: 1,500 when analyzing individual-level outcomes. Up to 18,000 (1500 x 12) when analyzing individual-vote level outcomes. Medium-term effects and cross-topic intervention: 4,000 when individual-level outcomes. Up to 36,000 (4,000 x 9) when analyzing individual-vote level outcomes.
Sample size (or number of clusters) by treatment arms
1500: 300 for each reasoning condition; 500 for each monetary incentive condition.
4000: 2000 for each reasoning condition.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
1. Argument Thoughtfulness score: Pilot data (non-registered pilot in September 2024) suggests a SD ≈ 2.28 on a 0–6 scale. With n=1,500, we can detect a mean difference of roughly 0.3–0.55 points (on 6-point scale) with 80% power. 2. Position Shift (Compromise): Pilot data (non-registered pilot in September 2024) suggests a SD ≈ 0.458 on a binary measure. With n=1,500, we can detect a mean difference of roughly 6-10 percentage points with 80% power.
IRB

Institutional Review Boards (IRBs)

IRB Name
Stanford University
IRB Approval Date
2024-05-31
IRB Approval Number
75616
IRB Name
Massachusetts Institute of Technology
IRB Approval Date
2024-04-08
IRB Approval Number
5798
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials