Evaluating the impact of AI-powered reasoning in deliberative processes

Last registered on August 01, 2025

View Trial History

Pre-Trial

Trial Information

General Information

Title

Evaluating the impact of AI-powered reasoning in deliberative processes

RCT ID

AEARCTR-0015461

Initial registration date

February 26, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

February 28, 2025, 10:40 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated

August 01, 2025, 4:41 PM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Country

United States of America

Region

Primary Investigator

Name

José Ramón Enríquez

Affiliation

Stanford University

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Alia Braley

PI Affiliation

University of California, Berkeley

Contact Investigator

PI Name

Nuole Chen

PI Affiliation

Massachusetts Institute of Technology

Contact Investigator

PI Name

Jiaxin Pei

PI Affiliation

Stanford University

Contact Investigator

PI Name

Alex Pentland

PI Affiliation

Stanford University

Contact Investigator

PI Name

Elisabeth Stockinger

PI Affiliation

ETH Zurich

Contact Investigator

PI Name

Lily Tsai

PI Affiliation

Massachusetts Institute of Technology

Contact Investigator

Additional Trial Information

Status

In development

Start date

2025-02-01

End date

2025-12-31

Keywords

Behavior, Governance

Additional Keywords

AI, LLM, deliberation, gun regulation, policy

JEL code(s)

C91, D78

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

This pre-analysis plan describes an experimental study evaluating whether AI-powered reasoning interventions and monetary incentives can enhance argument quality in an online policy debate on mental illness and U.S. gun regulation. Participants will be randomly assigned to either a reasoning (treatment) condition or a control condition, with further cross-randomization of direct or lottery-based financial rewards for thoughtful contributions. The primary outcomes include measures of argument thoughtfulness and compromise. In addition, secondary analyses will examine potential mechanisms, such as cognitive reflection, emotional state, diversity of perspectives, and engagement. This study seeks to identify practical, scalable methods for improving deliberative discourse in digital contexts.

External Link(s)

Registration Citation

Citation

Braley, Alia et al. 2025. "Evaluating the impact of AI-powered reasoning in deliberative processes." AEA RCT Registry. August 01. https://doi.org/10.1257/rct.15461-2.0

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

We test two main interventions for improving argument quality in an online policy debate on mental illness and firearm regulation in the United States:

1) AI-powered Reasoning: Participants receive prompts from a large language model (LLM) designed to foster more reflective, logically coherent arguments (i.e., Socratic dialogue), solo guided reflection (i.e., Reflective paragraph); LLM-powered emotional regulation (i.e., Emotional-regulation LLM); grammar correction (i.e., Grammar-correction LLM), or waiting control.

2) Monetary Incentives: Participants receive the option to receive direct monetary bonus (for top quartile arguments), a lottery-based bonus (for top quartile arguments), or no incentive.

Intervention (Hidden)

Reasoning conditions:
1. Socratic Dialogue (LLM-facilitated): Participants engage in a conversation with an LLM that uses the Socratic method. The LLM primarily asks open-ended, reflective questions about the participant’s argument or stance, prompting them to think critically and examine their reasoning. The LLM does not provide direct answers or opinions on the topic but encourages deeper reflection through probing questions.

2. Reflective Paragraph (No LLM): In this condition, there is no AI assistance. Participants are instructed to write one paragraph encompassing 4 to 5 sentences supporting their perspectives. This is, the treatment encourages participants to effectively engage in self reflection without any external prompts. This serves as a baseline for the effect of self-guided reflection.

3. Emotion-Regulation (LLM-facilitated): Participants interact with an LLM that focuses on participant’s emotional reactions to the topic. The LLM prompts the user to identify and reflect on their feelings about the issue (e.g., frustration, enthusiasm, anger) and guides them through basic emotion-regulation strategies (such as reappraising the situation or taking a neutral perspective). The objective of this condition is to test if acknowledging and managing emotions influences subsequent reasoning. This will allow us to test whether emotional regulation is driving the interaction with the LLM. The LLM’s responses center on the participant’s emotional state rather than the
content of their argument.

4. Grammar-Correcting (LLM-facilitated): In this placebo condition, participants receive assistance from an LLM that strictly provides stylistic improvements to their writing. For example, the LLM might correct grammar, fix spelling, or suggest clearer phrasing, but it does not offer any feedback on the argument’s content, logic, or depth. This condition controls for any effect of interacting with an LLM and improved language clarity, without the benefits of deeper reflection or content-oriented guidance.

5. Waiting Control: Participants in this condition do nothing related to the task for 2 minutes, the average time it would take to complete the reflective paragraph or the Socratic Dialogue treatment. They are asked to wait while the page is loading. This controls for the mere passage of time and any potential rest or “incubation effect,” without engaging in additional writing or receiving any feedback.

Monetary incentives:
1. Direct monetary incentive: Participants are given the opportunity to obtain an additional $1 USD (i.e., a 40% increase in their guaranteed, base payment) if their argument ranks within the top 25% of thoughtfulness as rated by the research team.

2. Lottery: Participants are given the opportunity to enter a lottery for a $10 USD bonus (i.e., 400% increase in their guaranteed, base payment). Participants whose argument ranks within the top 25% of thoughtfulness as rated by the research team. The lottery has a 10% chance of success (i.e., expected value of $1 USD).

3. No monetary incentive: Participants receive no additional monetary incentives beyond the base payment. No further instructions are included.

Details in Pre-Analysis Plan attached.

Intervention Start Date

2025-02-06

Intervention End Date

2025-02-28

Primary Outcomes

Primary Outcomes (end points)

1. Argument thoughtfulness score: A composite measure (0–6 scale) reflecting logical structure, clarity, and depth of each participant’s main comment.

2. Position shifting or compromise: The extent to which participants move from an extreme stance toward a middle-ground position on the policy.

Primary Outcomes (explanation)

1. Argument Thoughtfulness Score reflecting three dimensions (clarity, coherence, and depth), each rated by independent coders (Human, LLM, and index of both).

2. Position shifting is computed as the absolute change in stance on a -5 to +5 slider (e.g., from strongly oppose to moderately oppose). Additionally, we consider endorsement of seeded comments and a policy proposal, as well as willingness to donate to a non-profit organization.

Secondary Outcomes

Secondary Outcomes (end points)

1. Cognitive Reflection: Changes in CRT scores (pre-post).
2. Emotional State: Pre-post changes on a 2-dimension affect scale.
3. Perspective Diversity: Type-token ratio, semantic similarity variance, voting weighted average.
4. Engagement Indices: Number of comments edited or length of comment, voting behavior, willingness to donate, time spent.

Secondary Outcomes (explanation)

1. Cognitive Reflection is measured via a standard 2 CRT questions, taking the difference between baseline and endline.
2. Emotional State is captured using self-reported activation and pleasure scales (0–10).
3. Perspective Diversity in textual output is assessed via NLP metrics such as TTR (type-token ratio) and semantic embedding variance. Also, we estimate voting weighted average, considering the participant's self-reported position and agreement on seeded comments.
4. Engagement includes votes on others’ comments, revisions made, willingness to donate (binary), and time spent.

Experimental Design

We conduct an online experiment with participants recruited to debate whether people with mental illnesses should be allowed to purchase firearms. We randomize at the individual level along two dimensions:
(i) reasoning interventions (Socratic dialogue LLM vs. Reflective paragraph vs. Emotional-regulation LLM vs. Grammar-correction LLM vs. waiting control) and
(ii) monetary incentives (direct bonus vs. lottery vs. none).

The primary objective is to determine which intervention(s) most effectively increase argument thoughtfulness and willingness to compromise. We measure outcomes post-treatment via rubric-based coding of each participant’s main argument, changes in stance, and secondary indicators like voting and donation behavior.

Final analyses will compare each treatment arm with pooled or disaggregated controls, depending on tests of equivalence among control conditions.

Details in Pre-Analysis Plan attached.

Experimental Design Details

Details in Pre-Analysis Plan attached.

Randomization Method

Pseudo-random generator within the survey/intervention platform (deliberation.io). Each participant has equal probability of assignment to each cell in the 5×3 factorial design.

Randomization Unit

Individual.

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

N/A.

Sample size: planned number of observations

Initial intervention: 1,500 when analyzing individual-level outcomes. Up to 18,000 (1500 x 12) when analyzing individual-vote level outcomes. Medium-term effects and cross-topic intervention: 4,000 when individual-level outcomes. Up to 36,000 (4,000 x 9) when analyzing individual-vote level outcomes.

Sample size (or number of clusters) by treatment arms

1500: 300 for each reasoning condition; 500 for each monetary incentive condition.
4000: 2000 for each reasoning condition.

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

1. Argument Thoughtfulness score: Pilot data (non-registered pilot in September 2024) suggests a SD ≈ 2.28 on a 0–6 scale. With n=1,500, we can detect a mean difference of roughly 0.3–0.55 points (on 6-point scale) with 80% power. 2. Position Shift (Compromise): Pilot data (non-registered pilot in September 2024) suggests a SD ≈ 0.458 on a binary measure. With n=1,500, we can detect a mean difference of roughly 6-10 percentage points with 80% power.

Supporting Documents and Materials

IRB