Back to History Current Version

Do AI-Generated Assessments Affect Social Impact Scoring by Human Evaluators?

Last registered on December 26, 2025

Pre-Trial

Trial Information

General Information

Title
Do AI-Generated Assessments Affect Social Impact Scoring by Human Evaluators?
RCT ID
AEARCTR-0017476
Initial registration date
December 16, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
December 26, 2025, 2:34 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Humboldt-Universität zu Berlin

Other Primary Investigator(s)

Additional Trial Information

Status
In development
Start date
2025-12-01
End date
2026-02-28
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study examines whether exposure to AI-generated evaluative text influences how human assessors score the social impact of project proposals. Assessors are drawn from three groups with varying levels of expertise. In an online evaluation setting, assessors first review a proposal and provide an initial score focusing exclusively on its expected social impact. They are then shown AI-generated text assessing the social impact of the same proposal and are given the opportunity to revise their initial score and report their confidence in their decision. The study evaluates whether assessors update their social impact scores after seeing AI input and how confident they are in such updates.
External Link(s)

Registration Citation

Citation
Firpo, Teo. 2025. "Do AI-Generated Assessments Affect Social Impact Scoring by Human Evaluators?." AEA RCT Registry. December 26. https://doi.org/10.1257/rct.17476-1.0
Experimental Details

Interventions

Intervention(s)
AI evaluations are randomized into three treatments:
1. Evidence-only
2. Argument-focused
3. Pure control
Intervention Start Date
2025-12-01
Intervention End Date
2026-01-31

Primary Outcomes

Primary Outcomes (end points)
1. Binary update decision: whether the assessor updates their initial social impact score after exposure to AI-generated text (yes/no).
2. Confidence in update: self-reported confidence associated with the update decision.
3. Direction and magnitude of changes in social impact scores following AI exposure.
4. Differences in updating behavior and confidence across assessor expertise groups.
5. Predictive accuracy of scores (human and AI-generated) against real word outcomes
Primary Outcomes (explanation)
Scores are constructed on a scale from 1 to 10 (10 being highest), following the agency's real scoring approach. Confidence is measured on a 5-point scale with 5 being highest confidence.
For a portion of the proposals used in the study, data was collected from surveys on the real world outcomes of the funded applications. Using several indices created from these outcomes, we will test the accuracy of scores (both from humans and AI) in predicting the outcomes.

Secondary Outcomes

Secondary Outcomes (end points)
Self assessment of accuracy of scoring with respect to real world outcomes. Assessment of AI accuracy of scoring with respect to real world outcomes.
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
Assessors evaluate multiple project proposals in an online survey environment. For each proposal, assessors first provide an initial score assessing only the proposal’s expected social impact. Assessors are then shown AI-generated text evaluating the social impact of the same proposal and given the opportunity to revise their score and indicate their confidence. Assignment of AI-generated text is randomized at the proposal level. The experiment includes three groups of assessors with varying levels of expertise. Analysis compares updating behavior and confidence across assessors, proposals, and expertise groups.
Experimental Design Details
Not available
Randomization Method
Randomization of proposals, proposal order and treatments at the proposal-assessor level, conducted via Stata using the randtreat command and the date of randomization as seed.
Randomization Unit
Proposal (same proposals randomize to different assessors in varying order and under different treatments). Each assessor sees proposals under different treatments. (Not all assessors see the pure control treatment, only a small subset).
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
There are three groups of assessors:
1. External experts (EAs): the plan is recruit 30 experts.
2. Internal assessors with prior expertise in evaluating proposals under the same program (technical assessors, or TAs): 42 assessors.
3. Volunteer internal assessors who may or may not have evaluated proposals in the past but were not recently involved (volunteer assessors, VAs): 77 assessors.
Sample size: planned number of observations
EAs: 7 proposals each, for a total of 210 reviews TAs: different number by assessor, for a total of 292 reviews VAs: different number by assessor, for an estimated total of 300 reviews. All numbers are estimates since participants agreed to dedicate a specific amount of time to the task, without guarantees of a specific number of reviews each.
Sample size (or number of clusters) by treatment arms
149 assessors across groups, each reviewing proposals under the different treatments (within subject design).
An estimated 802 total reviews.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Assuming a mean for the social impact score of 6.22 and a SD of 1.68 (from the historical data) we would be powered to changes in scores of at least ±0.11 points.
IRB

Institutional Review Boards (IRBs)

IRB Name
GfeW ethical review procedure
IRB Approval Date
2025-12-16
IRB Approval Number
9uLGrjsA