Abstract
There is growing interest in whether Generative AI has the potential to transform education and offer personalized learning at scale, but early evidence suggests students often fail to engage productively with AI tutors. This study tests the causal value-added of AI tutoring over a well-implemented Computer-Assisted Learning (CAL) platform, and identifies which AI-design features increase productive engagement. We deploy NUMI, a research-purpose CAL platform with a guard-railed AI math tutor, in Grades 4–9 classrooms in the Hamilton County Department of Education (HCDE) over the 2026–27 school year. All students use NUMI in its Mastery mode (must answer three problems in a row correctly to advance) on a weekly exercise. Within each participating classroom, students are randomized to one of two sequences: AI in Trimester 1, no-AI in Trimester 2 (A-N-A); or no-AI in Trimester 1, AI in Trimester 2 (N-A-A). All students receive AI in Trimester 3 for fairness. The design supports a parallel-arm contrast in Trimester 1 and a within-student crossover estimate pooling Trimesters 1 and 2, with explicit tests for carryover. Within the AI condition, embedded randomized comparisons test pre-specified tutor design variations (e.g., conversational tone, response-mode prompts). Outcomes include immediate performance (next-question correctness, exit tickets), near-term retention (improvement checks at 2–4 week lag), and administrative outcomes (course grades, NWEA MAP, state assessments). Heterogeneity analyses focus on baseline achievement and disadvantaged subgroups.