Secondary Outcomes (explanation)
Mechanisms
Scientific reasoning
We use a shortened version of the Elementary Scientific Reasoning Questionnaire (Chionas and Emvalotis, 2025), validated for upper-primary students. The full ESRQ comprises eight dimensions grouped into three second-order factors: (i) Identifying Scientific Questions and Identifying Scientific Hypotheses, (ii) Data Generation: Planning Experimental Procedures, Identifying Experimental Procedures and (iii) Data Evaluation: Evaluating with Covariation, without Covariation, with Confounded Variables, Interpreting Graphical Data. For time reasons and given the more distinct factor loading, we drop the Interpreting Graphical Data dimension. Also for time constraints, each students is randomly assigned to respond to two of the four blocks in the original test (the order of the blocks is also randomized at individual level).
In computing the scores, we will follow the results in Chionas and Emvalotis (2025) and compute the 3 distinct outcomes (the percentage of correctly answered items within that dimension). However, since we do not expect the project to have a differential impact on these 3 dimensions and given the strong reported correlations among the 3 factors, in order to reduce concerns of multiple hypothesis, we will compute an overall index of scientific reasoning (e.g. the percentage of correct answers among all items, or using the inverse covariate weighting approach proposed by Anderson, 2008, applied to the scores in the 3 factors) .
Test performance is incentivized through a lottery design. 30 students are randomly selected to be paid based on performance, with each correct answers corresponding to 2 euros value for a gift card in a sports store.
Scientific curiosity
We develop a curiosity task, inspired by the task in Alan, Gumren and Mumcu (2024). From a pre-built bank of 149 short curiosities, most of which are likely to be unknown to students (82 STEM related across animals, space, earth, body, chemistry/physics; 66 non-STEM across history, geography, arts, sports and trivia), each student is shown a random draw of 10 STEM and 10 non-STEM curiosities, presented in randomised order. Each curiosity is presented a short title plus a teaser question (the answer is hidden). The student picks up to 10 of the 20 curiosities. We define as main outcome variable the number of STEM curiosities picked.
Self-efficacy in science
Self-efficacy in science is measured through two distinct outcomes:
- self-reported self-efficacy in science: before performing the scientific reasoning task, students are administered the 8 item DEVISE Self-Efficacy for Science (SES) Likert scale, developed and validated by the Cornell Lab of Ornithology (www.birds.cornell.edu/citscitoolkit/evaluation/instruments), and validated for the targeted age group by Peterman, Withy and Boulay (2018)
- incentivized self-evaluation in the scientific reasoning test: after performing the scientific reasoning test, students are asked to indicate the number of item that they expected to have respondent correctly (from 0 to 14). The guess is incentivized, students being awarded 5 additional points if they guess correctly.
Gender Stereotypes in STEM
Gender Stereotypes in STEM is also measured through two distinct outcomes:
- suitability of occupations by genders: students are shown a list of 25 professions, classified as STEM (10), non-STEM (10) and sport (10), and asked to indicate for each if they perceive it as being "more suited to a man", "more suited to a woman", or "suited to both. We construct as main outcome and index of gender-STEM stereotype score, given by the share of STEM jobs the student rates as suited to both. Since the gender stereotype component of the intervention did not address only the STEM field, we will compute an additional outcome considering all the 25 occupations.
- incentivized gender beliefs on top performers in the scientific reasoning test: after completing the scientific reasoning test and the self-evaluation, on a distinct survey page, students are asked to guess the number of female and males students among the top performers in the scientific reasoning task, across the entire sample of students participating in the project, given us a continuous measure of beliefs. We expect the project to have increased the expected share of females students among top performers. The guess was incentivized, awarding 10 additional points to the students closer to the true distribution.
Robustness checks for mediators and additional secondary outcomes
For the incentivized self-evaluation task, we will also explore in a separate analysis the gap relative to actual performance, in particular by gender, to observe if there is evidence of over-optimism or realism by treatment status. Give the documented gender gap in self-evaluation, we expect in particular among girls a narrowing of the gap between actual performance and expected performance.
As a robustness check, in the gender suitability dimension, we will construct an additional index, considering the share of students (in particular female students) indicating that the profession is suited to both or more suited to females. While we consider this unlikely, depending on the starting level of gender attitudes and self-efficacy, the intervention may have shift a proportion of females students from selecting the inclusive option to selecting the option identifying with their gender.
Additional secondary and exploratory outcomes
AI literacy and attitudes toward AI
We list the AI dimensions as a separate group of outcomes since we do not expect them to mediate the effects on the primary outcomes. However, as a distinct analysis, we will evaluate if the program impact AI literacy and AI attitudes. We developed a 10 item AI literacy test (with true and false response options) and a new scale on attitudes towards AI. The factor structure and internal coherence of the scale will be evaluated using only the data of control group students. Our hypothesis is that the scale will have 5 (utility, social relation, negative impact, ethics and critical thinking), however, we will conduct our analysis based on observed factor structure.
Perceptions of science and scientists, thematic analysis. As an exploratory analysis, following and slightly adapting Chionas and Emvalotis (2022), we will conduct a thematic analysis of responses to five open questions: "Could you briefly describe what you think science is?"; "Could you name a great scientific discovery you know about?"; "What three adjectives would you use to describe a scientist?"; "What three adjectives would you NOT use to describe a scientist?"; (q21) "Can you name two great scientists you know?". Given the large sample size, we will explore the possibility of validating an LLM-assisted coding procedure. We stress that this analysis is purely explorative, and will only be used to supplement the main quantitative results.
Additional questions on the scientific method. As an additional exploratory analysis, we will analyze the response to a battery of questions, designed by the program implementer in charge of the STEM module, addressing the role of the senses, aspects of the scientific method and research community, and specific knowledge items (related to sensors, radiation and other topics). Since these items touch on some of the specific topics covered during the training, identifying treatment effects would signal a strong first stage.
Additional heterogeneity
As exploratory heterogeneity analysis, we will investigate if effects vary by baseline levels of:
-STEM attitudes (index combining career aspirations, track intentions and self-efficacy)
- STEM proxies of abilities (index combining through ICW grades in math and science, and raven test)
- Gender attitudes (index combining, through ICW, the several gender scales used at baseline). Given the large gender gap in gender attitudes measured at baseline, it is likely that these results will go in line with the heterogeneity analysis by gender.
The aim of this additional heterogeneity analysis will be to complement and deepen the heterogeneity analysis by gender.
Finally, given the richness of the data collected at baseline, we will report as exploratory a data-driven heterogeneity analysis following the latest methodological developments in the literature (Chernozhukov et al., 2018; Athey and Wager, 2018, Chernozhukov, Demirer, Duflo and Fernández-Val, 2020 etc.).
References
Alan, S., & Mumcu, I. (2024). Nurturing childhood curiosity to enhance learning: Evidence from a randomized pedagogical intervention. American Economic Review, 114(4), 1173-1210.
Chernozhukov, V., Fernández‐Val, I., & Luo, Y. (2018). The sorted effects method: discovering heterogeneous effects beyond their averages. Econometrica, 86(6), 1911-1938.
Chernozhukov, V., Demirer, M., Duflo, E., & Fernández-Val, I. (2020). Generic machine learning inference on heterogenous treatment effects in randomized experiments. 2018.
Chionas, G., & Emvalotis, A. (2022). Greek upper primary grade students’ images about science and scientists: An alternative descriptive piece of the puzzle. In Frontiers in Education (Vol. 7, p. 933288). Frontiers Media SA.
Chionas, G., & Emvalotis, A. (2025). Scientific reasoning in upper primary school students: development and validation of the ESRQ. Research in Science & Technological Education, 1-24.
Peterman, K., Withy, K., & Boulay, R. (2018). Validating common measures of self-efficacy and career attitudes within informal health education for middle and high school students. CBE—Life Sciences Education, 17(2), ar26.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.