SPARKLE - STEM Practical Activities to Raise Knowledge Learning and Exploration: a cluster-randomised evaluation of a multi-component STEM, gender, and AI educational intervention in Italian middle schools.

Last registered on May 27, 2026

Pre-Trial

General Information

Title

RCT ID

AEARCTR-0018689

Initial registration date

May 20, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

May 27, 2026, 10:21 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Name

Dominique Cappelletti

Affiliation

FBK-IRVAPP

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Martina Bazzoli

PI Affiliation

FBK-IRVAPP

Contact Investigator

PI Name

Sergiu Burlacu

PI Affiliation

FBK-IRVAPP

Contact Investigator

PI Name

Iunio Quarto Russo

PI Affiliation

FBK-IRVAPP

Contact Investigator

PI Name

Lodovica Puxeddu

PI Affiliation

FBK-IRVAPP

Contact Investigator

Additional Trial Information

Status

On going

Start date

2025-08-01

End date

2026-07-31

Keywords

Behavior, Education, Gender

Additional Keywords

Education; STEM; gender stereotypes; artificial intelligence; science education; adolescence; school choice; artificial intelligence; randomized controlled trial; Italy; scientific reasoning; AI;

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

We evaluate SPARKLE, a multi-component, school-based educational intervention designed to increase middle-school students (second and third grade of middle school) interest and engagement with STEM subjects, careers and high-school tracks; to improve scientific reasoning and self-efficacy; to reduce gender stereotypes; and to support a more informed and critical relationship with artificial intelligence. The intervention is delivered in three modules: (i) an Inquiry-Based Science Education (IBSE) module with hands-on science activities using sensory kits developed by Fondazione Bruno Kessler and LevelUp (10 hours); (ii) a gender-coaching module on stereotypes, self-esteem and role models (6 hours, with content adapted from the DIMORE project, of University di Verona); (iii) an AI literacy and ethics module on the basic AI knowledge and critical thinking about AI outputs and applications(2 hours).

Treatment is randomised at the class level within school site × grade x specialization strata. Data are collected through pre- and post-intervention surveys administered to approximately 1,700 students, in 93 classes, across 15 schools.

We pre-specify primary outcomes (STEM career interest and high school track intentions / choices), expected mechanisms (scientific reasoning, self-efficacy, gender attitudes, scientific curiosity), and an additional group of secondary and exploratory outcomes (AI literacy and attitudes, perceptions of science, scientists and the scientific method). Gender is the main moderator of interest.

External Link(s)

Registration Citation

Citation

Bazzoli, Martina et al. 2026. "SPARKLE - STEM Practical Activities to Raise Knowledge Learning and Exploration: a cluster-randomised evaluation of a multi-component STEM, gender, and AI educational intervention in Italian middle schools. ." AEA RCT Registry. May 27. https://doi.org/10.1257/rct.18689-1.0

Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Experimental Details

Interventions

Intervention(s)

The SPARKLE intervention is a multi-component educational programme designed to increase STEM engagement among lower-secondary school students through hands-on scientific activities, exposure to artificial intelligence, and reflection on gender stereotypes in STEM. The intervention is delivered in 2025–2026 in classes of the second and third grade (grades 7 and 8) of middle school in Trentino and Veneto.

Treated classes participate in three modules, supported by teacher co-design and training (12 hours of teacher training in 4 modules / 6 sessions of 2 hours each). Control classes receive the standard curriculum and are offered the materials at the end of the school year on a waitlist basis.

The intervention consists of three student modules implemented during the school year.

1) Hands-on STEM experimentation activities:

10-hour laboratory programme developed by Fondazione Bruno Kessler (FBK) and LevelUp. Using advanced sensor kits specifically designed for the project, students explore scientific phenomena through inquiry-based science education (IBSE) and learning-by-doing activities. The programme focuses on experimentation, observation, measurement, hypothesis testing, data interpretation, and scientific reasoning. Activities are designed to encourage collaborative problem solving and active engagement with the scientific method. At the end of the programme, students participate in a collaborative challenge in which classes are asked to solve a series of scientific puzzles and experimental tasks using the concepts and tools introduced during the activities. The challenge is designed to reinforce scientific reasoning, teamwork, and problem-solving skills in an engaging setting.

1) Hands-on STEM experimentation activities:

- 10 hours organised in (a) four classroom sessions of 2 hours each (led by Level Up tutors), where students explore the scientific method through inquiry-based science education (IBSE) and learning-by-doing activities, using a custom sensors-based science kit (radiation and gas sensors) —students pose questions, formulate hypotheses, run experiments, compare instrument readings with sensory perceptions, and interpret results,

- an inter-class contest (the "mistery object" challenge) in which students use the kit to characterise an unknown object hidden in a sealed container. The five best-performing classes per province visit FBK's research laboratories for the "Science Day" (Giornata della Scienza).

2) Gender and self-esteem coaching: Approximately 6 hours, designed and delivered by coaches coordinated by the Università di Verona, building on the DIMORE intervention. The focus is on:

- raising awareness of gender stereotypes and their effects on choices and perceptions.

- developing self-esteem, aspirations and personal agency through interactive exercises, group discussions, role-model exposure, collaborative activities and self-reflection exercises. Activities also address empathy, unconscious bias and social expectations.

3) AI literacy and critical thinking. 2 hour module, delivered by FBK-Center for Augmented Intelligence and FBK-Center for Religious Studies, introducing artificial intelligence and generative AI systems. Specifically, it covers (i) what AI and generative AI - examples of everyday AI applications and basic understanding on how they work (algorithms, models, training data, the role of human design choices); (ii) limits and risks (hallucinations, data privacy, environmental costs, ethical considerations); (iii) critical thinking and responsible AI uses.

Intervention Start Date

2025-09-29

Intervention End Date

2026-05-31

Primary Outcomes

Primary Outcomes (end points)

Upper Secondary-school track preferences and choices

STEM career interest

Primary Outcomes (explanation)

STEM career interest is measured through the semantic-differential scale in Christensen and Knezek (2017) (administered also at baseline). Students are asked to rate the prospects of a career in science, technology, engineering or mathematics on a 7 steps scale, for a set of 5 pairs of opposing adjectives (e.g. boring -interesting, not at all important - important). The final score is computed by summing up the rating on the 5 pairs of adjectives, with negative items reverse-coded so that higher values indicate more positive STEM career attitudes.

STEM high-school track intentions. Students allocate a fixed budget of 10 points ("hearts") across the ten high-school tracks of the Italian system. The budget-allocation format is designed to proxy proportional intensity of preference while limiting cognitive load (Giustinelli and Manski, 2018). The primary measure is the share of hearts allocated to STEM tracks: (hearts allocated to the scientific track or the technical technological track / 10). At baseline we measured preferences for each track (on a scale from 0 to 10), without imposing a fixed budget allocation.

Robustness and secondary outcomes. The budget allocation task may produce a non-trivial mass at zero (i.e. students allocation 0 points to both tracks) or at ten, on the STEM high-school track intentions outcome. If the shape of the empirical distribution does not make the analysis suitable for OLS models, besides using models better suited for such censored distributions (e.g. Tobit models), we will dichotomize the outcome (an indicator equal to 1 if at least one STEM track is among the student's top-three most-rated tracks)

Because the intervention addressed both the fundamentals of the scientific method but also involved hands-on technological/engineering activities, we expects co-movements within the two STEM tracks. For this reason, we will additionally report effects on each STEM track separately as a secondary outcome.

Realized choice (grade-3 only). Grade-3 students are asked, in addition to the hearts task, which school they actually choice. Since this decision was made before some of the main project activities and may not entirely reflects pupils' preferences, we therefore report effects on this measure as a secondary outcome.

Empirical specification

For a student i in class c, stratum s, with endline outcome Y and baseline outcome Y0, we will estimate, primarily through OLS, the following specification:

Yics = a0 + a1*Tc + a2'Xics + Ds + eics

Where Tc indicate the class level treatment indicator, with its coefficient a1 being our estimate of intention-to-treat (ITT) effects.

Xics is a vector of covariates measured at baseline. A subset of covariates will be fixed in all models (gender, raven score, STEM attitudes index, self-efficacy index, grade dummy, and the baseline level of the outcome Y0 whenever available). The additional covariates will be selected through data driven approaches, following the recent guidelines in the literature.

Ds are strata fixed effects.

Estimand: Given the high participation rates in project activities, the ITT will be very close the ATE. For this reason, we will also not estimate LATE.

Missing data at baseline: for the very small share of students absent at baseline, we will impute the values of the covariates using sample means or modes.

Standard errors will be clustered at student level.

Heterogeneity

The main dimensions of heterogeneity is gender.

Multiple hypotheses testing

We will evaluate the robustness of our results to multiple hypotheses testing following the latest recommendations in the literature (e.g. Westfall and Young, 1993; Anderson 2008; Romano and Wolf, 2016; Young, 201; List, Shaikha and Xu, 2019).

Spillover analysis

Given the class level randomization, spillovers are possible, though due to the hands-on nature of the intervention we expect them to be minimal. However, as a check, we can explore the varying treatment intensities within strata created by the experimental design. While, in strata with even number of class, the randomization protocols produces a balanced number of treated and control classes, in strata with odd number of classes the ratio of treated classes within stratum can vary from 1/3 to 2/3, creating exogenous variation in exposure to treated students. Assuming spillovers are a function of the number of treated students within stratum, we can explore if treatment effects different by the share of treated classes within stratum.

References

Anderson, M. L. (2008). Multiple inference and gender differences in the effects of early intervention: A reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects. Journal of the American statistical Association, 103(484), 1481-1495.

Christensen, R., & Knezek, G. (2017). Relationship of middle school student STEM interest

to career intent. Journal of education in science environment and health, 3(1), 1-13.

Giustinelli, P., & Manski, C. F. (2018). Survey measures of family decision processes for econometric analysis of schooling decisions. Economic Inquiry, 56(1), 81-99.

List, J. A., Shaikh, A. M., & Xu, Y. (2019). Multiple hypothesis testing in experimental economics. Experimental Economics, 22(4), 773-793.

Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.

Young, A. (2019). Channeling fisher: Randomization tests and the statistical insignificance of seemingly significant experimental results. The Quarterly Journal of Economics, 134(2), 557- 598.

Secondary Outcomes

Secondary Outcomes (end points)

Mediators
- Scientific reasoning
- Scientific curiosity
- Self-efficacy in science
- Gender stereotypes in STEM

Additional Secondary outcomes
- AI literacy and attitudes toward generative AI
- Subcomponents of the scientific reasoning task

Exploratory Dimensions
- Open-ended perceptions of science and scientists.

Secondary Outcomes (explanation)

Mechanisms

Scientific reasoning
We use a shortened version of the Elementary Scientific Reasoning Questionnaire (Chionas and Emvalotis, 2025), validated for upper-primary students. The full ESRQ comprises eight dimensions grouped into three second-order factors: (i) Identifying Scientific Questions and Identifying Scientific Hypotheses, (ii) Data Generation: Planning Experimental Procedures, Identifying Experimental Procedures and (iii) Data Evaluation: Evaluating with Covariation, without Covariation, with Confounded Variables, Interpreting Graphical Data. For time reasons and given the more distinct factor loading, we drop the Interpreting Graphical Data dimension. Also for time constraints, each students is randomly assigned to respond to two of the four blocks in the original test (the order of the blocks is also randomized at individual level).

In computing the scores, we will follow the results in Chionas and Emvalotis (2025) and compute the 3 distinct outcomes (the percentage of correctly answered items within that dimension). However, since we do not expect the project to have a differential impact on these 3 dimensions and given the strong reported correlations among the 3 factors, in order to reduce concerns of multiple hypothesis, we will compute an overall index of scientific reasoning (e.g. the percentage of correct answers among all items, or using the inverse covariate weighting approach proposed by Anderson, 2008, applied to the scores in the 3 factors) .

Test performance is incentivized through a lottery design. 30 students are randomly selected to be paid based on performance, with each correct answers corresponding to 2 euros value for a gift card in a sports store.

Scientific curiosity
We develop a curiosity task, inspired by the task in Alan, Gumren and Mumcu (2024). From a pre-built bank of 149 short curiosities, most of which are likely to be unknown to students (82 STEM related across animals, space, earth, body, chemistry/physics; 66 non-STEM across history, geography, arts, sports and trivia), each student is shown a random draw of 10 STEM and 10 non-STEM curiosities, presented in randomised order. Each curiosity is presented a short title plus a teaser question (the answer is hidden). The student picks up to 10 of the 20 curiosities. We define as main outcome variable the number of STEM curiosities picked.

Self-efficacy in science

Self-efficacy in science is measured through two distinct outcomes:
- self-reported self-efficacy in science: before performing the scientific reasoning task, students are administered the 8 item DEVISE Self-Efficacy for Science (SES) Likert scale, developed and validated by the Cornell Lab of Ornithology (www.birds.cornell.edu/citscitoolkit/evaluation/instruments), and validated for the targeted age group by Peterman, Withy and Boulay (2018)
- incentivized self-evaluation in the scientific reasoning test: after performing the scientific reasoning test, students are asked to indicate the number of item that they expected to have respondent correctly (from 0 to 14). The guess is incentivized, students being awarded 5 additional points if they guess correctly.

Gender Stereotypes in STEM
Gender Stereotypes in STEM is also measured through two distinct outcomes:
- suitability of occupations by genders: students are shown a list of 25 professions, classified as STEM (10), non-STEM (10) and sport (10), and asked to indicate for each if they perceive it as being "more suited to a man", "more suited to a woman", or "suited to both. We construct as main outcome and index of gender-STEM stereotype score, given by the share of STEM jobs the student rates as suited to both. Since the gender stereotype component of the intervention did not address only the STEM field, we will compute an additional outcome considering all the 25 occupations.
- incentivized gender beliefs on top performers in the scientific reasoning test: after completing the scientific reasoning test and the self-evaluation, on a distinct survey page, students are asked to guess the number of female and males students among the top performers in the scientific reasoning task, across the entire sample of students participating in the project, given us a continuous measure of beliefs. We expect the project to have increased the expected share of females students among top performers. The guess was incentivized, awarding 10 additional points to the students closer to the true distribution.

Robustness checks for mediators and additional secondary outcomes

For the incentivized self-evaluation task, we will also explore in a separate analysis the gap relative to actual performance, in particular by gender, to observe if there is evidence of over-optimism or realism by treatment status. Give the documented gender gap in self-evaluation, we expect in particular among girls a narrowing of the gap between actual performance and expected performance.

As a robustness check, in the gender suitability dimension, we will construct an additional index, considering the share of students (in particular female students) indicating that the profession is suited to both or more suited to females. While we consider this unlikely, depending on the starting level of gender attitudes and self-efficacy, the intervention may have shift a proportion of females students from selecting the inclusive option to selecting the option identifying with their gender.

Additional secondary and exploratory outcomes

AI literacy and attitudes toward AI

We list the AI dimensions as a separate group of outcomes since we do not expect them to mediate the effects on the primary outcomes. However, as a distinct analysis, we will evaluate if the program impact AI literacy and AI attitudes. We developed a 10 item AI literacy test (with true and false response options) and a new scale on attitudes towards AI. The factor structure and internal coherence of the scale will be evaluated using only the data of control group students. Our hypothesis is that the scale will have 5 (utility, social relation, negative impact, ethics and critical thinking), however, we will conduct our analysis based on observed factor structure.

Perceptions of science and scientists, thematic analysis. As an exploratory analysis, following and slightly adapting Chionas and Emvalotis (2022), we will conduct a thematic analysis of responses to five open questions: "Could you briefly describe what you think science is?"; "Could you name a great scientific discovery you know about?"; "What three adjectives would you use to describe a scientist?"; "What three adjectives would you NOT use to describe a scientist?"; (q21) "Can you name two great scientists you know?". Given the large sample size, we will explore the possibility of validating an LLM-assisted coding procedure. We stress that this analysis is purely explorative, and will only be used to supplement the main quantitative results.

Additional questions on the scientific method. As an additional exploratory analysis, we will analyze the response to a battery of questions, designed by the program implementer in charge of the STEM module, addressing the role of the senses, aspects of the scientific method and research community, and specific knowledge items (related to sensors, radiation and other topics). Since these items touch on some of the specific topics covered during the training, identifying treatment effects would signal a strong first stage.

Additional heterogeneity

As exploratory heterogeneity analysis, we will investigate if effects vary by baseline levels of:
-STEM attitudes (index combining career aspirations, track intentions and self-efficacy)
- STEM proxies of abilities (index combining through ICW grades in math and science, and raven test)
- Gender attitudes (index combining, through ICW, the several gender scales used at baseline). Given the large gender gap in gender attitudes measured at baseline, it is likely that these results will go in line with the heterogeneity analysis by gender.

The aim of this additional heterogeneity analysis will be to complement and deepen the heterogeneity analysis by gender.

Finally, given the richness of the data collected at baseline, we will report as exploratory a data-driven heterogeneity analysis following the latest methodological developments in the literature (Chernozhukov et al., 2018; Athey and Wager, 2018, Chernozhukov, Demirer, Duflo and Fernández-Val, 2020 etc.).

References

Alan, S., & Mumcu, I. (2024). Nurturing childhood curiosity to enhance learning: Evidence from a randomized pedagogical intervention. American Economic Review, 114(4), 1173-1210.

Chernozhukov, V., Fernández‐Val, I., & Luo, Y. (2018). The sorted effects method: discovering heterogeneous effects beyond their averages. Econometrica, 86(6), 1911-1938.

Chernozhukov, V., Demirer, M., Duflo, E., & Fernández-Val, I. (2020). Generic machine learning inference on heterogenous treatment effects in randomized experiments. 2018.

Chionas, G., & Emvalotis, A. (2022). Greek upper primary grade students’ images about science and scientists: An alternative descriptive piece of the puzzle. In Frontiers in Education (Vol. 7, p. 933288). Frontiers Media SA.

Chionas, G., & Emvalotis, A. (2025). Scientific reasoning in upper primary school students: development and validation of the ESRQ. Research in Science & Technological Education, 1-24.

Peterman, K., Withy, K., & Boulay, R. (2018). Validating common measures of self-efficacy and career attitudes within informal health education for middle and high school students. CBE—Life Sciences Education, 17(2), ar26.

Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.

Experimental Design

This is a cluster randomized controlled trial conducted in lower-secondary schools in Italy. The project targets second- and third-year students. The intervention has three main components: a coaching module on gender stereotypes and self-confidence, a hands-on STEM laboratory module based on inquiry-based science education (IBSE), and a module on artificial intelligence and research ethics.

Data are collected through baseline and follow-up questionnaires administered during school hours in computer labs using oTree. Students access the questionnaires through anonymous personal tokens assigned by school referents. Researchers do not observe the mapping between student names and tokens. The same token is used across waves to link baseline and follow-up responses while preserving anonymity.

The study includes 11 comprehensive schools, 93 classes, and approximately 1,700 students. Classrooms are randomly assigned to treatment or control. The study includes 51 treated classes and 42 control classes.

Randomization is stratified by school site, track, and grade.

Experimental Design Details

Not available

Randomization Method

In office by a computer

Randomization Unit

Classroom

Was the treatment clustered?

Yes

Experiment Characteristics

Sample size: planned number of clusters

93 classrooms

Sample size: planned number of observations

The total sample includes 1711 pupils for which the consent form was provided by parents (out of 1856 students in total).

Sample size (or number of clusters) by treatment arms

51 classrooms treatment, 42 classrooms control

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Assuming an intra-cluster-correlation (ICC) of 0.07, the MDES is 0.205 standard deviations, at 5% significance level and 80% power. We expect the rich set of baseline covariates to be highly predictive of outcomes. Assuming an R2 of 0.5, the MDES is lowered to 0.145

Supporting Documents and Materials

IRB

Institutional Review Boards (IRBs)

IRB Name

CORDI – Comitato per le Ricerche con Dati sull’Individuo, Università di Verona

IRB Approval Date

2025-09-19

IRB Approval Number

prot. 2025-UNVRCLE-0401086– rep. 1997/2025 del 18/09/2025;

IRB Name

CORDI – Comitato per le Ricerche con Dati sull’Individuo, Università di Verona

IRB Approval Date

2026-04-19

IRB Approval Number

protocol 2026-UNVRCLE-0167191; rep. 789/2026

Analysis Plan