Trust or Verify? Automation Bias in Physician-LLM Diagnostic Reasoning

Last registered on April 30, 2025

Pre-Trial

Trial Information

General Information

Title
Trust or Verify? Automation Bias in Physician-LLM Diagnostic Reasoning
RCT ID
AEARCTR-0015864
Initial registration date
April 23, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
April 30, 2025, 8:42 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
Lahore University of Management Sciences (LUMS)

Other Primary Investigator(s)

PI Affiliation
Lahore University of Management Sciences
PI Affiliation
King Edward Medical University
PI Affiliation
Lahore General Hospital
PI Affiliation
Children's Hospital, Lahore

Additional Trial Information

Status
In development
Start date
2025-05-20
End date
2025-07-15
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
Diagnostic errors represent a significant cause of preventable patient harm in healthcare systems worldwide. Recent advances in Large Language Models (LLMs) have shown promise in enhancing medical decision-making processes. However, there remains a critical gap in our understanding of how automation bias---the tendency to over-rely on technological suggestions---influences medical doctors' diagnostic reasoning when incorporating these AI tools into clinical practice.

This randomized controlled trial (RCT) aims to systematically measure the extent and patterns of automation bias among medical doctors when utilizing ChatGPT-4o in clinical decision-making. We will assess how access to LLM-generated information influences diagnostic reasoning through a novel methodology that precisely quantifies automation bias. In our study design, we will randomly assign participants to one of two groups. The treatment group will receive LLM-generated recommendations containing deliberately introduced errors in a subset of cases, while the control group will receive LLM-generated recommendations without such deliberately introduced errors.

Prior to participation, all medical doctors will complete a comprehensive training program covering LLM capabilities, prompt engineering techniques, and output evaluation strategies. Responses will be evaluated by blinded reviewers using a validated assessment rubric specifically designed to detect uncritical acceptance of erroneous information, with greater score disparities indicating stronger automation bias.
External Link(s)

Registration Citation

Citation
Akhtar, Muhammad Junaid et al. 2025. "Trust or Verify? Automation Bias in Physician-LLM Diagnostic Reasoning." AEA RCT Registry. April 30. https://doi.org/10.1257/rct.15864-1.0
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
Treatment group will be provided access to LLM generated recommendations, some of which will contain deliberately flawed diagnostic information.
Intervention (Hidden)
Participants will have access to a specific, commercially available LLM (ChatGPT-4o) in addition to conventional diagnostic resources. They
will evaluate six clinical vignettes (three with deliberately flawed diagnostic information generated by the LLM and three with correct information), presented in random order.
Intervention Start Date
2025-05-20
Intervention End Date
2025-06-25

Primary Outcomes

Primary Outcomes (end points)
The primary outcome will be the diagnostic reasoning score, calculated as a percentage score (0-100%) for each case.
Primary Outcomes (explanation)
For each case, participants will be asked for three top diagnoses, findings from the case that support that diagnosis, and findings from the case that oppose that diagnosis. For each plausible diagnosis, participants will receive 1 point. Findings supporting the diagnosis and findings opposing the diagnosis will also be graded based on correctness, with 1 point for partially correct and 2 points for completely correct responses. Participants will then be asked to name their top diagnosis, earning one point for a reasonable response and two points for the most correct response. Finally participants will be asked to name up to 3 next steps to further evaluate the patient with one point awarded for a partially correct response and two points for a completely correct response. The primary outcome will be compared on the case-level by the randomized groups.

Secondary Outcomes

Secondary Outcomes (end points)
Final diagnosis score, calculated as a percentage score (0-100%) for each case.
Secondary Outcomes (explanation)
Final diagnosis score, calculated as a percentage score (0-100%) for each case.

Experimental Design

Experimental Design
The trial will be designed as a randomized, two-arm, single-blind parallel group study.
Experimental Design Details
This study will be a single-blind, randomized controlled trial with two arms designed to measure automation bias:

- Intervention Arm: Participants will have access to a specific, commercially available LLM (ChatGPT-4o) in addition to conventional diagnostic resources. They will evaluate six clinical vignettes (three with deliberately flawed diagnostic information generated by the LLM and three with correct information), presented in random order.

- Control Arm: Participants will have access to a specific, commercially available LLM (ChatGPT-4o) in addition to conventional diagnostic resources. They will evaluate the same six clinical vignettes, but with LLM-generated recommendations that do not contain any deliberately introduced errors. Vignettes will be presented in random order.
Randomization Method
Randomization done in office by a computer
Randomization Unit
Individual medical doctor.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
No clustering. Individual medical doctors will be randomized.
Sample size: planned number of observations
No clustering. Individual medical doctors will be randomized.
Sample size (or number of clusters) by treatment arms
25 medical doctors each in both treatment and control
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
The minimum detectable effect size for the primary outcome (i.e., the difference in mean diagnostic reasoning scores) is 8 percentage points between groups.
IRB

Institutional Review Boards (IRBs)

IRB Name
Institutional Review Board, Lahore University of Management Sciences
IRB Approval Date
2025-04-21
IRB Approval Number
IRB-0374
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials