Synthetic Experiments for Decision-Making Under Uncertainty: Validating AI Agent-Based Research Methods Against Human Experimental Data

Last registered on December 27, 2025

Pre-Trial

Trial Information

General Information

Title
Synthetic Experiments for Decision-Making Under Uncertainty: Validating AI Agent-Based Research Methods Against Human Experimental Data
RCT ID
AEARCTR-0017171
Initial registration date
November 03, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
November 10, 2025, 9:09 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
December 27, 2025, 10:08 AM EST

Last updated is the most recent time when changes to the trial's registration were published.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Bocconi University

Other Primary Investigator(s)

PI Affiliation
Bocconi University
PI Affiliation
Bocconi University
PI Affiliation
Bocconi University
PI Affiliation
Bocconi University

Additional Trial Information

Status
In development
Start date
2025-11-04
End date
2026-03-31
Secondary IDs
Prior work
This trial is based on or builds upon one or more prior RCTs.
Abstract
Can synthetic experiments with AI agents serve as rigorous research tools for generating insights, testing hypotheses, and informing managerial decision-making under uncertainty? This pre-analysis plan specifies a comprehensive validation study comparing 16 persona generation methods across a 2×2×4 factorial design: two base sampling algorithms crossed with a clustering algorithm and four constraint configurations. Each of the 16 persona generation methods generates a sample of 800 college undergraduate students who serve as synthetic experimental subjects. The 16 samples undergo the identical 2×2 factorial experiment (Causal Reasoning Training × ChatGPT Access) previously conducted with real human participants, with participants randomized across the four experimental conditions (200 per cell). Our research questions ask: (1) Which persona generation methods better approximate humans as experimental subjects? (2) How do the three design dimensions (base algorithm, archetypes, constraints) affect approximation quality? (3) Can process differences in manipulation checks and compliance explain divergent outcomes? By benchmarking against gold standard human experimental data, we develop methods for building high-fidelity synthetic agents and define a scalable protocol for conducting synthetic experiments in management research. This registration ensures transparency in methodology specification and analytical decisions prior to synthetic data generation.
External Link(s)

Registration Citation

Citation
Camuffo, Arnaldo et al. 2025. "Synthetic Experiments for Decision-Making Under Uncertainty: Validating AI Agent-Based Research Methods Against Human Experimental Data." AEA RCT Registry. December 27. https://doi.org/10.1257/rct.17171-2.0
Experimental Details

Interventions

Intervention(s)
The synthetic experiments replicate the experimental design of a randomized controlled trial previously conducted with human subjects (pre-registered separately). The experiment with human subjects tested whether causal reasoning trainig (Intervention 1 via causal resoning game) and ChatGPT access (Intervention 2) produce complementary effects on strategic decision-making performance.
Intervention Start Date
2025-11-04
Intervention End Date
2026-03-31

Primary Outcomes

Primary Outcomes (end points)
Performance score = (Awareness + Usage) / 2
Primary Outcomes (explanation)
Performance within experiment is measured using a two-attribute rubric. Awareness captures the quality of strategies ensuring target audiences notice and recall merchandise. Usage captures strategies driving actual purchasing behavior. Each attribute is scored 1-5. The overall performance score averages the two attributes. The same rubric is applied to both human and synthetic responses under blind scoring protocols. Distance to Expert Solution provides a rubric-independent performance measure. We calculate semantic similarity between each response and a gold-standard expert solution using embedding based cosine distance.

For each synthetic experiment $j \in \{1, \ldots, 16\}$, we estimate the same regression model as the human benchmark:
\begin{equation}
Y_i^{(j)} = \beta_0^{(j)} + \beta_1^{(j)} \text{Causal}_i + \beta_2^{(j)} \text{ChatGPT}_i + \beta_3^{(j)} (\text{Causal}_i \times \text{ChatGPT}_i) + \epsilon_i^{(j)}
\end{equation}
We will use the Chow test to assess structural equality and parameter stability between each synthetic experiment and the human benchmark. The Chow test evaluates whether the coefficients from the synthetic experiments are statistically different from those in the human experiment, testing the null hypothesis:
\begin{equation}
H_0: \beta_k^{(j)} = \beta_k^{\text{human}} \text{ for all } k \in \{0, 1, 2, 3\}
\end{equation}
Methods that fail to reject the null hypothesis at conventional significance levels will be considered high-fidelity approximations of human experimental behavior.
To assess whether synthetic responses span similar ranges and distributions as human responses, we will calculate the Jensen-Shannon divergence (JSD) between synthetic and human outcome distributions:
\begin{equation}
\text{JSD}^{(j)} = \text{JSD}\left(P_{\text{synthetic}}^{(j)} \| P_{\text{human}}\right)
\end{equation}
Lower JSD values indicate better distributional alignment. This metric captures whether synthetic agents produce the full spectrum of human-like responses rather than converging to modal patterns.

Secondary Outcomes

Secondary Outcomes (end points)
Causal reasoning score
Secondary Outcomes (explanation)
Causal Reasoning Score assesses three attributes on 1-5 scales. The overall score averages the three attributes. This measure tests whether synthetic agents exhibit realistic causal reasoning patterns that match human responses. We also use a secondary measure which is the self-reported causal reasoning score as reported by both human and synthetic participants. Process Measures include response length and linguistic complexity, ChatGPT usage intensity (prompt count and conversation depth for GPT-access conditions), and solution originality (deviation from modal responses).

Experimental Design

Experimental Design
The synthetic experiments replicate the experimental design of a randomized controlled trial previously conducted with human subjects (pre-registered separately). The experiment with human subjects tested whether causal reasoning training and ChatGPT access produce complementary effects on strategic decision-making performance using a 2×2 factorial design with 800 participants. Using the results from the experiment with human subjects as ground truth, these 16 synthetic experiments test the same hypotheses and attempt to replicate the same results. The overarching goal is to test the extent to which synthetic agents approximate the responses of human subjects and therefore produce similar experimental outcomes. Key Design Principle: The experimental design and execution are identical across all 17 experiments (16 synthetic + 1 human). The ONLY difference is the participant pool: human subjects versus synthetic personas generated through different methods.
Experimental Design Details
Not available
Randomization Method
Simple randomization assigns personas to the four experimental conditions (n=200 per cell). Stratification balances on key demographic variables. Randomization seeds are logged for reproducibility.
Randomization Unit
Syntetic persona
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
0 clusters
Sample size: planned number of observations
800 synthetic personas
Sample size (or number of clusters) by treatment arms
200 synthetic personas for treatment arm Non-Causal, No ChatGPT.
200 synthetic personas for treatment arm Non-Causal, ChatGPT.
200 synthetic personas for treatment arm Causal, No ChatGPT.
200 synthetic personas for treatment arm Causal, ChatGPT.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Each synthetic experiment uses N=800 (200 per cell), matching human sample. Power calculations from the human RCT apply. Verbatim from Human RCT: "With N = 800 (200 participants per cell), we have more than adequate power for all planned analyses. For the main effects assuming Cohen’s d = 0.3 (typical for educational interventions), the power exceeds 97%, using two-way ANOVA with α = 0.05. Pairwise comparisons between conditions maintain over 90% power for detecting small-to-medium effects (Cohen’s d = 0.3)."
IRB

Institutional Review Boards (IRBs)

IRB Name
Bocconi University Ethics Committee
IRB Approval Date
2025-10-16
IRB Approval Number
EA001075
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information