NUMI: An AI-Tutor for Learning How Best to Deliver Transformative Personalized Education at Scale

Last registered on May 27, 2026

Pre-Trial

Trial Information

General Information

Title
NUMI: An AI-Tutor for Learning How Best to Deliver Transformative Personalized Education at Scale
RCT ID
AEARCTR-0018678
Initial registration date
May 24, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
May 27, 2026, 11:11 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
University of Toronto

Other Primary Investigator(s)

PI Affiliation
University of Toronto

Additional Trial Information

Status
In development
Start date
2026-08-01
End date
2027-06-30
Secondary IDs
J-PAL North America SPRI award number GR-8417
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
There is growing interest in whether Generative AI technology has the potential to transform education and offer personalized learning for all. Early research, however, shows challenges with implementation and structure for getting students to use it effectively. To test AI's potential and develop best practices for engaging students, we propose a student-level, within-class randomized evaluation of NUMI, an online platform designed for research that pairs mastery pacing with a guard-railed AI math tutor. In partnership with the Hamilton County Department of Education (HCDE) and at least 50-70 Grades 4-9 teachers (~1,500-2,100 students), each trimester students are randomly assigned in a 2x2 design — Mastery vs. no Mastery; AI-Tutor vs. no AI-Tutor. We will measure immediate learning on lagged "improvement checks," behavioral mechanisms such as persistence after mistakes and time-on-task, and district outcomes including course grades and, where available, standardized assessments. Embedded randomized tests within the AI arms will allow us to assess a limited number of pre-specified design variations, such as prompting style and timing of support, in order to identify which forms of AI guidance increase productive engagement and learning. The study builds on previous J-PAL collaborations and pivots to a flexible, customizable platform that will causally identify the value-added of AI tutoring over well-implemented CAL, and inform policy makers and program designers which AI and structural features deliver the largest durable, equitable gains at scale.
External Link(s)

Registration Citation

Citation
Liut, Michael and Philip Oreopoulos. 2026. "NUMI: An AI-Tutor for Learning How Best to Deliver Transformative Personalized Education at Scale." AEA RCT Registry. May 27. https://doi.org/10.1257/rct.18678-1.0
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
The intervention is NUMI, a research-oriented computer-assisted learning (CAL) platform that delivers weekly math practice to Grades 4-9 students. Each week, teachers assign a 30-90 minute NUMI assignment consisting of short instructional videos, exercises, and a brief exit-ticket assessment that contributes to a participation grade. Students who do not finish in class complete the remainder at home or during extra time.

Students are randomized at the individual level within their classroom to one of four conditions in a 2x2 factorial design:
- CAL only (control): non-Mastery progression, no AI tutor, step-by-step solutions on mistakes.
- CAL + AI: non-Mastery progression, plus access to NUMI's guard-railed AI tutor.
- CAL + Mastery: must answer 3 problems correctly in a row before advancing, no AI tutor.
- CAL + Mastery + AI: Mastery progression plus AI tutor.

In Mastery mode, students must answer three problems in a row correctly before advancing to the next exercise; in non-Mastery mode, students decide when they feel ready to take the exit ticket. In the AI arms, the AI tutor becomes available in a safe, domain-bounded chat space when a student makes a mistake or requests help. It elicits the student's reasoning and walks through steps together without revealing the final answer. NUMI primarily poses questions requiring binary (yes/no) or multiple-choice responses, with an occasional "Help Me Get Started" option (option A/B). After a student errs, NUMI suggests the likely misconception, reviews the worked solution step-by-step, and prompts for further questions. A text box is available for open-ended questions; filters and classifiers suppress off-topic or personal content. No student inputs are stored by any external AI provider. In the non-AI arms, students see step-by-step worked solutions when they make a mistake.

Trimester rotation: Each student rotates conditions across the three trimesters, experiencing three of the four conditions, never repeating. The rotation is pre-randomized at the start of the year.

Embedded A/B test within AI arms (pre-specified): Students in the two AI arms are independently randomized to either standard prompting or a humor / relatable-examples variant. Additional embedded A/B tests may be added in registry updates before each trimester begins.

Improvement checks: Every 2-4 weeks, teachers deliver short in-class "improvement check" assessments that revisit prior content after a lag to gauge retention.
Intervention Start Date
2026-08-01
Intervention End Date
2027-06-30

Primary Outcomes

Primary Outcomes (end points)
Pooled exit-ticket performance on weekly NUMI assignments, stacked across all weekly assignments within each trimester. Two measures:

1. Exit-ticket pass rate (binary): indicator equal to 1 if the student passes the assignment's exit ticket on first attempt, 0 otherwise, averaged across all weekly assignments within trimester.
2. Exit-ticket standardized score (continuous): score on the exit ticket, standardized to mean 0 and SD 1 within assignment, then averaged across all weekly assignments within trimester.

Both measures are computed for each (student x trimester) and pooled across trimesters in the headline confirmatory analysis.
Primary Outcomes (explanation)
Exit-ticket performance is the primary outcome because: (a) it is collected from every student on every assignment, yielding the largest sample and the highest statistical power (MDE approximately 0.091 SD); (b) it is curriculum-aligned and tightly coupled to the content the platform delivers, providing a direct test of whether the intervention improves what it is designed to teach; and (c) it is available on the same time scale as the randomization rotation, allowing clean attribution of effects to each trimester's assigned condition.

Secondary Outcomes

Secondary Outcomes (end points)
Secondary outcomes are organized into four pre-specified families. Romano-Wolf stepdown corrections (Romano & Wolf 2005, 2016) are applied for family-wise error rate (FWER) control within each family.

Family A — Near-term and distal learning
- A1. Improvement-check scores (2-4 week lag retention assessments)
- A2. NWEA MAP end-of-term assessment scores
- A3. End-of-year state standardized test score
- A4. End-of-trimester math course grade

Family B — Engagement and persistence mechanisms
- B1. Time-on-task per assignment (minutes)
- B2. Probability of continuing to work after a mistake (within-exercise)
- B3. Number of attempts to mastery (Mastery arms)
- B4. Within-exercise persistence (problems attempted before disengaging)
- B5. Weekly assignment completion rate

Family C — Quality of AI use (AI arms only)
- C1. Share of help sessions with reflective (non-trivial) student responses vs. quick answer-seeking
- C2. Frequency of help requests per assignment
- C3. Average response length in AI interactions
- C4. Hint-seeking vs. answer-seeking ratio

Family D — Administrative outcomes
- D1. End-of-trimester overall math course grade
- D2. School attendance (if available from district administrative data)
Secondary Outcomes (explanation)
Family A tests whether immediate effects on exit-tickets translate into durable, externally-validated learning gains. Family B captures the behavioral mechanisms through which Mastery and AI tutoring are theorized to operate. Family C is a within-AI-arm analysis of how students use the AI tutor, distinguishing productive reflective engagement from passive answer-seeking. Family D captures real-world administrative outcomes that districts and policymakers care about.

Experimental Design

Experimental Design
A within-classroom, student-level randomized 2x2 factorial design (Mastery x AI-Tutor) with trimester-level rotation, conducted across Grades 4-9 math classes in the Hamilton County Department of Education (HCDE) in Chattanooga, Tennessee. Classrooms serve as the stratification block; students within a classroom are randomly assigned by the NUMI platform to one of the four arms (CAL-only; CAL+AI; CAL+Mastery; CAL+Mastery+AI), then rotate through three of four arms across the school year's three trimesters. A secondary, factorial randomization within the AI arms assigns students to humor / relatable-examples prompting vs. standard prompting.
Experimental Design Details
Not available
Randomization Method
Computer-generated randomization, executed server-side by the NUMI platform using a pseudo-random number generator with a fixed seed logged at the time of assignment. Stratification is by classroom (each classroom receives a balanced number of students per arm to the extent class size permits).
Randomization Unit
Individual student, stratified by classroom.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
50-70 classrooms (target: 60), taught by 50-70 recruited HCDE teachers across Grades 4-9.
Sample size: planned number of observations
1,500-2,100 students (target: 1,800), at approximately 30 students per classroom.
Sample size (or number of clusters) by treatment arms
Approximately 375-525 students per arm per trimester in the four-arm comparison (target ~450 per arm). For the AI-vs-no-AI pooled comparison, ~750-1,050 students per pooled arm.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
- Primary outcome (pooled exit-ticket pass/fail, stacked ~8 tests/student, within-student rho = 0.5): MDE approximately 0.091 SD at 80% power, two-sided alpha = 0.05. - Across-arm single-term comparison (50 teachers, ~1,500 students): MDE approximately 0.20 SD. - Across-arm single-term comparison (70 teachers, ~2,100 students): MDE approximately 0.17 SD. - AI vs. no-AI pooled (~1,050 vs. ~1,050): MDE approximately 0.12 SD. - Repeated-measures with student fixed effects: MDE approximately 0.14-0.16 SD. Power calculations are lower bounds — additional teacher and district recruitment is planned.
IRB

Institutional Review Boards (IRBs)

IRB Name
IRB Approval Date
IRB Approval Number