NUMI: A Within-Class Randomized Evaluation of AI-Tutoring in Mastery-Based Computer-Assisted Math Learning

Last registered on May 18, 2026

Pre-Trial

Trial Information

General Information

Title
NUMI: A Within-Class Randomized Evaluation of AI-Tutoring in Mastery-Based Computer-Assisted Math Learning
RCT ID
AEARCTR-0018643
Initial registration date
May 14, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
May 18, 2026, 7:22 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Primary Investigator

Affiliation
University of Toronto

Other Primary Investigator(s)

PI Affiliation
University of Toronto

Additional Trial Information

Status
In development
Start date
2026-08-03
End date
2027-12-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
There is growing interest in whether Generative AI has the potential to transform education and offer personalized learning at scale, but early evidence suggests students often fail to engage productively with AI tutors. This study tests the causal value-added of AI tutoring over a well-implemented Computer-Assisted Learning (CAL) platform, and identifies which AI-design features increase productive engagement. We deploy NUMI, a research-purpose CAL platform with a guard-railed AI math tutor, in Grades 4–9 classrooms in the Hamilton County Department of Education (HCDE) over the 2026–27 school year. All students use NUMI in its Mastery mode (must answer three problems in a row correctly to advance) on a weekly exercise. Within each participating classroom, students are randomized to one of two sequences: AI in Trimester 1, no-AI in Trimester 2 (A-N-A); or no-AI in Trimester 1, AI in Trimester 2 (N-A-A). All students receive AI in Trimester 3 for fairness. The design supports a parallel-arm contrast in Trimester 1 and a within-student crossover estimate pooling Trimesters 1 and 2, with explicit tests for carryover. Within the AI condition, embedded randomized comparisons test pre-specified tutor design variations (e.g., conversational tone, response-mode prompts). Outcomes include immediate performance (next-question correctness, exit tickets), near-term retention (improvement checks at 2–4 week lag), and administrative outcomes (course grades, NWEA MAP, state assessments). Heterogeneity analyses focus on baseline achievement and disadvantaged subgroups.
External Link(s)

Registration Citation

Citation
Liut, Michael and Philip Oreopoulos. 2026. "NUMI: A Within-Class Randomized Evaluation of AI-Tutoring in Mastery-Based Computer-Assisted Math Learning." AEA RCT Registry. May 18. https://doi.org/10.1257/rct.18643-1.0
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
Each week, participating teachers in Grades 4–9 assign a NUMI module: one curriculum-aligned exercise with short instructional content and practice problems, ending in a brief exit ticket worth a participation grade. Unfinished work is completed at home or during extra time within the week. All students use NUMI in Mastery mode, meaning they must answer three problems in a row correctly before advancing within the exercise.
The intervention manipulates one feature: access to an AI tutor. In the no-AI condition, students who answer incorrectly are shown a step-by-step written solution. In the AI condition, students are offered access to NUMI, a domain-bounded AI math tutor that elicits the student's reasoning rather than revealing the final answer; walks through solution steps interactively using yes/no, multiple-choice, or short-answer prompts; offers a "Help Me Get Started" option for first-step scaffolding; and after a mistake suggests likely misconceptions and walks through the worked solution while prompting comprehension checks. The AI tutor operates within math-only content filters, has access to the canonical step-by-step solution as a grounding reference, and stores only non-identifiable session metadata.
Every 2–4 weeks, teachers administer a short in-class Improvement Check that revisits prior content to measure retention. The intervention runs across three trimesters of the 2026–27 school year.
Intervention Start Date
2026-08-03
Intervention End Date
2027-05-28

Primary Outcomes

Primary Outcomes (end points)
Improvement Check score (z-scored within classroom-week), pooled across trimesters — measures retention 2–4 weeks after content delivery

Exit ticket performance — pass/fail and z-scored continuous score, pooled across weekly assignments

End-of-trimester math course grade (administrative record)
Primary Outcomes (explanation)
The Improvement Check is our preferred primary outcome because it measures durable learning rather than immediate task completion, addressing the concern that engagement-boosting tools may inflate within-session performance without producing retention. Exit tickets provide a high-frequency, well-powered measure of curriculum-aligned mastery on each weekly assignment. End-of-trimester course grades validate that any platform effects translate into outcomes the district cares about. All three are pre-specified as confirmatory;

Exit ticket performance — pass/fail and z-scored continuous score, pooled across weekly assignments

End-of-trimester math course grade (administrative record)

Secondary Outcomes

Secondary Outcomes (end points)
NWEA MAP score (end of trimester, z-scored); state standardized math test score (end of school year); math course pass rate.
Mechanism outcomes: time-on-task per assignment; probability of attempting the next question after a mistake; attempts-to-mastery per exercise; persistence within exercise (questions attempted before exit); share of AI help-sessions in which the student engages reflectively; post-assignment short-survey measures of engagement and self-efficacy.
Secondary Outcomes (explanation)
NWEA MAP and the state assessment are administrative learning measures collected by HCDE and matched to student-level platform data. Mechanism outcomes are constructed from NUMI platform logs and are intended to test whether AI exposure changes the behavioral channels theory predicts (persistence after errors, reflective engagement) rather than only end-of-task performance. "Reflective AI engagement" will be operationalized prior to launch; the candidate definition is the share of help-sessions in which the student produces non-trivial text input and progresses through at least one comprehension-check prompt rather than immediately requesting the worked solution.

Experimental Design

Experimental Design
Within-classroom, student-level two-period crossover design with universal phase-in, conducted across three trimesters of the 2026–27 school year.
All students use NUMI in Mastery mode on weekly math exercises. Within each participating classroom, students are randomly assigned to one of two sequences:

Sequence A-N-A: AI tutor in Trimester 1, no-AI (step-by-step solutions only) in Trimester 2, AI in Trimester 3.
Sequence N-A-A: no-AI in Trimester 1, AI in Trimester 2, AI in Trimester 3.

The Trimester 1 comparison provides a parallel-arm contrast between the two conditions. Pooling Trimesters 1 and 2 with student fixed effects provides a crossover estimate of the AI effect, with explicit testing for carryover. Trimester 3 is universal AI: it does not provide a contemporaneous treatment contrast, but ensures every student receives AI in two of three trimesters (a fairness consideration with the partnering district), and supports analyses of persistence and implementation under universal access. Within the AI condition, embedded randomized comparisons test pre-specified tutor design features.
Experimental Design Details
Not available
Randomization Method
Computer-generated pseudo-random sequence assignment performed by the NUMI platform prior to the start of Trimester 1, stratified within classroom. The seed and randomization code will be archived for replication.
Randomization Unit
Student, within classroom. Sequence (A-N-A or N-A-A) is assigned once at baseline; the trimester switch and the universal-AI Trimester 3 are deterministic given sequence assignment.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
60 classrooms (target; range 50–70) across approximately the same number of teachers in the Hamilton County Department of Education (HCDE)
Sample size: planned number of observations
1,800 students (target; range 1,500–2,100), approximately 30 students per classroom
Sample size (or number of clusters) by treatment arms
Sequence A-N-A: ~900 students (target; range 750–1,050)
Sequence N-A-A: ~900 students (target; range 750–1,050)
By condition within Trimester 1 (parallel-arm contrast): ~900 students in AI, ~900 students in no-AI.
By condition within Trimester 2: ~900 in AI, ~900 in no-AI (assignments switched).
Trimester 3: all ~1,800 students in AI (universal phase-in).
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
All estimates assume 80% power, two-sided α = 0.05, and 1 SD outcome scale. Parallel-arm contrast in Trimester 1 (≈900/arm): MDE ≈ 0.132 SD. Parallel-arm contrast in Trimester 1, lower-bound sample (≈750/arm): MDE ≈ 0.145 SD. Parallel-arm contrast in Trimester 1, upper-bound sample (≈1,050/arm): MDE ≈ 0.123 SD. Crossover (T1+T2) with student fixed effects, within-student ρ = 0.5: MDE ≈ 0.085–0.10 SD. Exit-ticket pass/fail, stacked across weekly assignments within T1+T2: MDE ≈ 0.06 SD with >20,000 observations and assumed within-student correlation of 0.5. Question-level next-correct after mistake: MDE <0.05 SD given hundreds of thousands of observations. Crossover MDEs use within-student variance σ²(1−ρ); ρ = 0.5 is illustrative and design-specific power simulations will be conducted prior to launch using realistic within-student correlation estimates from prior CAL studies. Effects in the 0.10–0.20 SD range are policy-relevant for district decision-making and consistent with magnitudes considered educationally meaningful for year-long interventions.
IRB

Institutional Review Boards (IRBs)

IRB Name
University of Toronto
IRB Approval Date
2026-02-12
IRB Approval Number
64147