Testing the efficacy of AI tutoring in secondary mathematics: A scaled RCT in UK classrooms

Last registered on April 13, 2026

Pre-Trial

Trial Information

General Information

Title
Testing the efficacy of AI tutoring in secondary mathematics: A scaled RCT in UK classrooms
RCT ID
AEARCTR-0018079
Initial registration date
April 09, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
April 13, 2026, 9:33 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
Eedi Labs

Other Primary Investigator(s)

Additional Trial Information

Status
In development
Start date
2026-03-18
End date
2026-08-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
One-to-one tutoring is widely regarded as the gold standard for personalised education, yet it remains prohibitively expensive to deliver at scale. Recent advances in generative AI have prompted growing interest in whether large language models (LLMs) can approximate the pedagogical effectiveness of expert human tutors. However, empirical evidence remains scarce. Most published evaluations of AI tutoring systems rely on user satisfaction metrics or short-duration laboratory tasks rather than rigorous field experimentation with validated learning outcome measures.

An exploratory randomised controlled trial (RCT) conducted in 2025 provided initial evidence that AI tutoring can support student mathematics learning at levels similar to expert human tutoring (N = 165; arxiv:2512.23633). In that trial, an AI tutor integrated into the Eedi mathematics platform and supervised by human tutors produced a 5.5 percentage-point increase in students’ likelihood of correctly answering novel questions relative to expert human tutoring alone. Supervising tutors approved 76% of the AI tutor’s drafted messages without substantive edits, and communicated general enthusiasm for the pedagogical quality and performance of the AI tutor.
The present trial scales and extends the design of the exploratory trial, with the aim of testing whether the initial findings generalise to a larger cohort, accumulate over a sustained intervention period, and translate to gains on an independent standardised assessment. We will recruit and randomize 1,200 students across UK secondary schools to four arms over a full school term. Students in the control arm will receive Eedi's standard remediation (static content and practice). Students in two AI tutoring arms will receive interactive tutoring from an AI system operating under expert supervision, with the arms differing in the contextual information provided to the AI: one will receive only session-level pedagogical context, while the other will receive longitudinal student information incorporating historical performance, prior misconceptions, and curriculum position. Students in the fourth arm will receive support from human expert tutors. Our primary outcome will be change on STAR Maths (Renaissance), a computer-adaptive standardised assessment administered at baseline and endline. Secondary outcomes will include: near-term measures of learning gains; self-report measures of student self-efficacy, motivation, and attitudes toward learning from errors; and tutor affect and experience across conditions.
External Link(s)

Registration Citation

Citation
Brazão, Vasco. 2026. "Testing the efficacy of AI tutoring in secondary mathematics: A scaled RCT in UK classrooms." AEA RCT Registry. April 13. https://doi.org/10.1257/rct.18079-1.0
Experimental Details

Interventions

Intervention(s)
Intervention Start Date
2026-04-13
Intervention End Date
2026-06-21

Primary Outcomes

Primary Outcomes (end points)
Student scores on the STAR Renaissance Maths assessment, adjusted for baseline scores.
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
Misconception resolution; near-term knowledge transfer; interim quiz success
Secondary Outcomes (explanation)
Misconception resolution: each intervention is triggered by a wrong answer on an initial multiple-choice question (the “check-in” question) and followed by a similar multiple-choice question (the “check-out” question). We use a correct response on the check-out question as a proxy for misconception resolution. We will examine differences between conditions on the probability of misconception resolution.

Near-term knowledge transfer: questions are grouped into topic quizzes of 5 check-in questions. When an intervention occurs after one of the first four check-in questions, we will take the student’s response to the next question in the quiz as a measure of near-term knowledge transfer and examine differences between conditions on the students’ probability of success.

Interim quiz success: students will complete up to 2 interim quizzes throughout the experimental period. These quizzes will consist of 8 questions assessing constructs assigned by the students’ maths teacher in the previous 4 weeks of instruction. We will examine success on these quizzes between conditions, both overall and over time.

Experimental Design

Experimental Design
We will recruit approximately 1,200 students in Years 8, 9, and 10 from 10 UK secondary schools. We will individually randomise students across 4 arms, with each student remaining in their single assigned arm for the full duration of the trial. The arms differ by what happens when a student submits a wrong answer to a check-in question while working on a quiz. Students in arm 1 (control) will receive the standard Eedi intervention with static (non-interactive) content, aimed at resolving the students' misconception and providing practice opportunities. Students in arms 2, 3, and 4 will instead start a tutoring conversation, also aimed at resolving the students' misconception and providing practice opportunities. In arms 2 and 3, the students will interact with an LLM whose generated messages will be supervised by expert human tutors. In arm 2, the LLM will receive session-level pedagogical context (the current question, the student's answer, and the identified misconception); in arm 3, the LLM will additionally receive longitudinal information about the student, including historical performance, prior misconceptions, and curriculum position. In arm 4, students will interact directly with an expert human tutor working on their own (without LLM support).

Tutors will be allocated to different arms on a daily basis, so that each tutor will experience all arms over the course of the trial.
Experimental Design Details
Not available
Randomization Method
Students will be machine-randomised whereby each Eedi UserId will be assigned to one condition by randomly permuting a vector of strings representing arms, where each arm appears the pre-specified number of times, using an R script and a seed for reproducibility.
Randomization Unit
Individual
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
1,200 students
Sample size: planned number of observations
1,200 students
Sample size (or number of clusters) by treatment arms
Arm 1: 225; Arm 2: 375; Arm 3: 375; Arm 4: 225
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
While running our power analysis, we chose to prioritise the comparison between arms 2 and 3 on our primary outcome (score on the Star Math assessment). Assuming 50% of variance explained by covariates (notably, a baseline Star Math assessment), we expect to have 58.6% power to detect an effect of 0.1 SD and 99.2% power for an effect of 0.2 SD.
IRB

Institutional Review Boards (IRBs)

IRB Name
Human Behavioural Research Ethics Committee
IRB Approval Date
2026-03-09
IRB Approval Number
#25/003