Human vs. AI Leadership Coaching: A Randomized Controlled Trial of Coaching Modality Across Organizations

Last registered on June 23, 2026

Pre-Trial

Trial Information

General Information

Title
Human vs. AI Leadership Coaching: A Randomized Controlled Trial of Coaching Modality Across Organizations
RCT ID
AEARCTR-0018953
Initial registration date
June 21, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
June 23, 2026, 8:36 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
UC Berkeley

Other Primary Investigator(s)

PI Affiliation
Berkeley Haas
PI Affiliation
Berkeley Haas

Additional Trial Information

Status
In development
Start date
2026-06-15
End date
2027-07-01
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study examines whether AI-powered leadership coaching produces comparable effects to human coaching on interpersonal leadership competencies. Using a randomized controlled trial, we compare two conditions: (a) human coaching delivered by certified coaches from [anonymous coaching platform’s] network and (b) AI coaching delivered via [anonymous coaching platform’s] generative AI tool purpose-built for behavioral and leadership development. The study is conducted at a large financial services firm. We are also actively seeking to identify an additional research site, which would increase statistical power and allow us to assess the extent to which effects generalize across institutional contexts; any such extension would be registered separately.

Primary outcomes include leadership self-efficacy, leadership identity, lay theories of leadership, anticipated image risk, and psychological capital, measured via validated survey instruments at baseline and post-intervention. We also include two behavioral measures: the Leadership Divergent Association Task (L-DAT), a computational measure of leadership schema breadth, and supervisor ratings of job performance collected at both waves. A secondary aim is to examine heterogeneity in coaching effects by employee tenure and organizational site.
External Link(s)

Registration Citation

Citation
De Vaan, Mathijs, Daniel Lobo and Sameer Srivastava. 2026. "Human vs. AI Leadership Coaching: A Randomized Controlled Trial of Coaching Modality Across Organizations." AEA RCT Registry. June 23. https://doi.org/10.1257/rct.18953-1.0
Experimental Details

Interventions

Intervention(s)
Intervention Start Date
2026-06-15
Intervention End Date
2027-03-15

Primary Outcomes

Primary Outcomes (end points)
The primary outcomes of interest are pre-to-post changes in the following leadership competency measures, each assessed via validated survey instruments at baseline (wave 1) and post-intervention (wave 2). For each outcome, the primary analysis follows an intent-to-treat (ITT) estimand: all randomized participants are included regardless of engagement level, and we estimate the average difference in wave 2 scores between the AI coaching and human coaching conditions, conditional on baseline scores, using an ANCOVA specification. We will run two-sided tests of differences between conditions and report 95% confidence intervals. Per-protocol analyses conditioning on minimum engagement thresholds will be reported as secondary analyses.

We do not specify a directional prediction for any primary outcome. Although AI coaching tools have shown promise in some areas of behavioral development, the relative efficacy of AI versus human coaching on leadership-specific outcomes — including self-efficacy, identity, and schema breadth — is theoretically and empirically unsettled. Human coaches may produce stronger effects through relational depth and personalization, while AI coaching may perform comparably or better by virtue of consistent availability and lower engagement friction. Given these competing mechanisms, we treat the direction of any difference as an open empirical question and rely on two-sided tests throughout.

Self-Report Outcomes (Waves 1 and 2)
Outcome 1 — Leadership Self-Efficacy (self-report). Six items on a 7-point scale (Cunningham, Sonday, and Ashford, AMJ 2023).

Outcome 2 — Leadership Identity (self-report). Four items on a 7-point scale (DeRue and Ashford, AMR 2010; Day and Yip, Leadership Quarterly 2011; Cunningham et al., AMJ 2023; Lanai et al., JAP 2022).

Outcome 3 — Lay Theories of Leadership (self-report). Four items on a 7-point scale measuring incremental versus entity beliefs about leadership ability (Cunningham et al., AMJ 2023; Hoyt et al., PSPB 2012).

Outcome 4 — Leadership Anticipated Image Risk (self-report). Four items on a 7-point scale measuring perceived social risk of taking on leadership roles (Cunningham et al., AMJ 2023; Zhang et al., Organization Science 2020).

Outcome 5 — Psychological Capital (self-report). Twelve items measuring optimism, self-efficacy, hope, and resilience on a 6-point scale (Luthans et al., Personnel Psychology 2007).

Behavioral Outcome (Waves 1 and 2)

Outcome 6 — Leadership Divergent Association Task / L-DAT (behavioral). A domain-adapted behavioral measure of leadership schema breadth based on the validated Divergent Association Task (Olson et al., PNAS 2021). Participants generate ten leadership-related nouns; the score reflects the mean pairwise semantic distance among the first seven valid responses, computed using pretrained GloVe word vectors. Administered at both waves; takes approximately 4 minutes each.

Performance Outcome (Waves 1 and 2)

Outcome 7 — Job Performance (behavioral). Supervisor ratings of participant job performance collected at both wave 1 (pre-treatment) and wave 2 (post-intervention), assessed on a 1–5 scale. Collecting performance ratings at baseline allows the primary ANCOVA specification to partial out pre-existing performance differences between conditions, increasing precision. Performance ratings provide an objective, externally-evaluated complement to the self-report and behavioral task outcomes and allow us to assess whether coaching modality effects on leadership competencies translate into observable workplace performance.
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
Outcome 8 — Working Alliance (Wave 2 Only). Goal and Task subscales of the Working Alliance Inventory — Short Form Revised (WAI-SR; Hatcher and Gillaspy, Psychotherapy Research 2006), adapted for the coaching context. We expect human coaching to produce higher working alliance scores than AI coaching on this relational dimension. This outcome is exploratory and will be interpreted as hypothesis-generating.
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
This is a two-arm randomized controlled trial conducted at a large financial services firm. Participants enrolled in an internal leadership development initiative are randomly assigned to one of two coaching conditions: certified human coaching or AI coaching.

We are actively seeking to identify an additional research site; any multi-site extension will be registered as a separate pre-registration.
Experimental Design Details
Not available
Randomization Method
Randomization is done by the financial services firm at which the study takes place, by a computer.
Randomization Unit
Randomization is done at the individual level.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
None.
Sample size: planned number of observations
Approximately 200 participants (approximately 100 per condition).
Sample size (or number of clusters) by treatment arms
100 individuals in human coaching, 100 individuals in AI coaching.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
With N = 100 per condition (200 total), a two-sided test at α = .05, and 80% power, the study can detect a minimum effect of d = 0.396 (approximately 0.343 scale points on the 1–7 outcome scales, or roughly 5.7% of the full scale range). This calculation assumes a pooled residual standard deviation of 0.866, derived from a pooled outcome SD of 1.0 — consistent with observed variance on self-reported leadership scales in prior work (Hoyt et al., 2012, SDs = 0.89–0.94; Cunningham et al., 2023, SDs = 0.73–1.56) — and a baseline-to-outcome correlation of r = 0.50, which reduces residual variance under the ANCOVA specification. The ANCOVA approach further increases effective power beyond a simple post-test comparison by partialling out baseline variance. We are actively seeking to identify an additional research site; adding a comparable second site would increase total N to approximately 400 and reduce the MDE to d = 0.280 (0.243 scale points, ~4% of 1-7 scale range). Effects smaller than d = 0.396 may exist but would not be reliably detectable at this sample size; results should be interpreted accordingly.
IRB

Institutional Review Boards (IRBs)

IRB Name
Salus IRB
IRB Approval Date
2026-02-21
IRB Approval Number
25121901
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information