Abstract
This study is designed as a randomized training and survey experiment embedded in an eight-week professional development program on the use of artificial intelligence (AI) in education. The design combines randomized exposure to training content with repeated survey-based measurements, allowing for causal inference on both the overall effects of AI training and the specific role of instruction focused on AI oversight.
Recruitment targets in-service primary and secondary school teachers nationwide. Participation in the training program is free of charge. To incentivize participation, teachers receive a certificate issued by the University of Crete in Greece that is globally recognized. Eligible teachers are randomly assigned to training cohorts based on program capacity (T1). Teachers assigned to the initial training cycle begin the program immediately and form the treatment group, while those assigned to a subsequent cycle experience a delay and serve as the control group during the initial evaluation period.
All teachers in the first training cycle receive a common core curriculum during weeks 1–5. This core training focuses on the use of AI tools in educational practice and includes instructional videos, practical exercises, comprehension checks, and applied assignments such as lesson planning and the development of educational materials. The objective of this phase is to establish baseline competence in AI-assisted educational tasks.
In week 5 of the program, participants in the treatment group are randomly and evenly assigned to one of two subgroups. The first subgroup (T2A) receives specialized instruction in AI oversight during weeks 6–8. This module emphasizes evaluating the quality and reliability of AI-generated outputs, identifying bias and errors, cross-checking sources, and applying human judgment when using AI recommendations. The second subgroup (T2B) receives training in alternative digital educational tools, such as online quizzes and other non-AI instructional technologies. This subgroup serves as a comparison group for isolating the causal effect of AI oversight instruction.
Randomization into the T2A and T2B subgroups is conducted at the individual level and implemented by the training platform prior to the start of week 6. Assignment is independent of teacher characteristics and baseline outcomes, ensuring comparability across groups.
Outcomes are measured through three digital survey waves: a baseline survey before the start of training (week 0), a midline survey at the end of the core curriculum (week 5), and an endline survey following completion of the full program (week 8). Each survey includes randomized task-based modules and vignette-style grading scenarios. In productivity and creativity tasks, teachers are randomly assigned to complete tasks either with or without AI assistance, allowing identification of within-person effects of AI support. In grading scenarios, teachers are randomly exposed to AI recommendations of varying strictness, enabling measurement of reliance on, and oversight of, algorithmic advice.