Developing Simple Methods for Artificial Intelligence Learning and Education (SMaILE) through a gamified educational app: an RCT with middle school pupils in Italy

Last registered on April 26, 2023


Trial Information

General Information

Developing Simple Methods for Artificial Intelligence Learning and Education (SMaILE) through a gamified educational app: an RCT with middle school pupils in Italy
Initial registration date
April 21, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
April 26, 2023, 5:16 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.



Primary Investigator

Royal Holloway University of London

Other Primary Investigator(s)

PI Affiliation
Politecnico di Torino
PI Affiliation
Research Institute for the Evaluation of Public Policies (FBK-IRVAPP)
PI Affiliation
Politecnico di Torino

Additional Trial Information

On going
Start date
End date
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Artificial Intelligence (AI) has increasingly become part of our everyday life. Therefore, it is crucial that young adults start familiarizing themselves with it early. The successful use of AI for humanity strongly relies on the abilities and competencies of the people who develop, implement, and use it. Thus, a fundamental prerequisite to addressing the profound changes that our society faces is educating people with solid digital competencies, among which AI plays a central role. Progressively it will become crucial to understand the mechanisms behind it, and attitudes and knowledge of how AI works. We evaluate a randomized educational intervention that aims to familiarize middle school pupils with AI concepts and stimulate interest in STEM related subjects and careers. A randomly selected sub-sample of classes are granted access to an innovative app for mobile devices (phones and tablets) - “SMaILEApp”. The app aims to teach complex AI techniques to pupils through gamification, breaking down complex concepts into straightforward applications, accessible to a broad spectrum of young users. In line with gamification techniques, the App uses points, credits, rewards, challenges and virtual goods to ensure maximum engagement of the users. The SMaILEApp thus allows students to learn AI by playing. The app is an educational macro-game that contains micro-games, each of which focuses on a specific AI topic, e.g. machine learning, planning, optimization, etc.

A primary objective of this app is to change attitudes towards AI: the students learn that there are multiple AI sub-areas, each with its strengths and weaknesses, instead of treating AI as a monolithic magic “black box”.

SMaILE app as an educational device has been tested on 57 middle-school classes (second grade of middle school) in 20 schools in the Piemonte region, in Northern Italy. Within each school, half of the classes are randomly assigned the app from the 2022-23 academic year, a pre and post test are carried out, and administrative information from school registries will be collected at the end of the school year to complement the analysis. We plan to measure how this app changes attitudes and perceptions towards AI, and, in particular, whether students show a more informed view of the role of AI in society, after understanding it better.
External Link(s)

Registration Citation

Ballatore, Maria Giulia et al. 2023. "Developing Simple Methods for Artificial Intelligence Learning and Education (SMaILE) through a gamified educational app: an RCT with middle school pupils in Italy." AEA RCT Registry. April 26.
Experimental Details


The SMaILEApp aims to teach complex AI techniques to pupils through gamification, breaking down complex concepts into straightforward applications, accessible to a broad spectrum of young users. SMaILEApp starts from simplified real-world problems that could benefit from AI solutions and assists users in gradually developing the required building blocks required to design and implement such a solution. The main modality of the game is the city building. Users are given the role of the mayor and are tasked with building a city through a game of positioning tiles. The aim is to make the city as sustainable as possible by building efficient parks, roads, schools, hospitals and other facilities. Once the city is built and populated, a high level of sustainability needs to be maintained. The mayor can do this by instructing the various aldermen, who are the AI engines that assist him/her. In line with gamification techniques, the app uses points, credits, rewards, challenges and virtual goods to ensure maximum engagement of the users. The SMaILEApp thus allows students to learn artificial intelligence by playing. The app is an educational macro-game that contains micro-games, each of which focuses on a specific AI topic, e.g. machine learning, planning, optimization, etc.
Intervention Start Date
Intervention End Date

Primary Outcomes

Primary Outcomes (end points)
AI attitudes
AI theoretical knowledge
AI reasoning
Heterogeneous treatment effects will be systematically explored by gender.
Primary Outcomes (explanation)
The first group of primary outcomes of interest, most directly related to the intervention, is represented by:

- AI theoretical knowledge (test developed by the research team). The variable was not measured at baseline; however, it likely correlates with several of the other outcomes. The test is composed of 12 trials with multiple choice answers. The score will be given by the percentage of correct trials.

- AI reasoning (test developed by the research team). A shorter version with different trials of the test with only 3 items was also measured at baseline. The test at follow-up has 9 trials with multiple choice answers and closed answers. The score will be given by the percentage of correct trials. At baseline, we additionally measured computational abilities using a test adapted from Bati (2018), which we expect to correlate strongly with AI reasoning.

- Attitudes towards AI (measured through the test in Kim, Seong-Won & Lee, Youngjun, 2020). The test has 17 items on a Likert 5 scale ranging from “Strongly Disagree” to “Strongly Agree”. After reversing some of the items, the final score will be computed by summing the individual items (normalized and/or standardized). The variable was also measured at baseline.

Given female’s lower scores in AI attitudes and reasoning, measured also at baseline, we expect potentially stronger effects for female than for male students (e.g. the gap becoming smaller).

Secondary Outcomes

Secondary Outcomes (end points)
Attitudes towards a STEM career

High school track intentions

Mental rotation abilities

Middle school grades in math and science (if made available by the school)

Environmental identity
Secondary Outcomes (explanation)
The first group of secondary outcomes of interest is represented by attitudes and interest in a STEM career and high-school track. We expect the intervention to possibly affect this through several channels capturing STEM attitudes and knowledge (AI related attitudes and abilities - see the primary outcomes, rotational abilities, improved grades - see the following groups of secondary outcomes). However, given the light touch nature of the intervention and the fact that career and high-school track intentions are heavily influenced by other factors (parents, peers, teachers etc.), which the intervention does not address directly, there is the risk that any effects cannot be captured with sufficient statistical power.

- Attitudes towards a STEM career (measured through the scale in Christensen and Knezek, 2017). Pupils are asked to rate the prospects of a career in science, technology, engineering and mathematics on a 7 steps scale, for a set of 5 pairs of opposing adjectives (e.g. boring - interesting, not at all important - important). The final score is computed by summing up the rating on the 5 pairs of adjectives. The scale was also administered at baseline.

- STEM high-school track intentions. Pupils are asked to indicate, on a scale from 0 to 10, how interested they are in pursuing 7 possible highschool tracks from the Italian educational system. Two of these have a stronger STEM focus: the so-called Scientific high-school and Technical high-school. The two will be averaged together. However, we will also look at the individual tracks given that the intervention may push students away from the Technical track towards the more prestigious Scientific track. The variable was measured also at baseline. In addition to this, an additional question will ask students to indicate which is their most preferred choice from the list of 7 tracks. Given that the intervention is implemented in the second to last year of middle school, we will only be able to measure their actual choice in a follow-up study in a year if schools and parents agree to it.

- The next secondary outcome measures an additional STEM-related ability, Mental rotation abilities (measured through the shortened validated test in Yoon, S. Y. (2011), measured also at baseline): 10 trials test with multiple choice answers in 10 minutes. The test score is computed by summing up the number of trials with correct answers.

The third group of secondary outcomes are given by middle school grades in math, technology and science (if made available by the school). To ensure privacy, schools were provided a database with the unique project students IDs for each class in which they are expected to report grades in several subjects from the previous school year (pre-intervention) and at the end of the current school year (post intervention). If, for some reasons, schools do not systematically provide such data, this outcome will be excluded from the analysis. Note also that some teachers may grade students on the class curve, which would make grades not comparable across class.

As for the primary outcomes, we expect potentially larger effects for female students also on the secondary outcomes. With respect to STEM-related attitudes and high-school track intentions, computational and mental rotation abilities, we measured a gap in favor of male students at baseline. As a result, the main dimension of heterogeneity explored will be gender.

The last secondary outcome analyzed is Environmental Identity, a shortened (4 item) scale taken from the original scale in Panzone et al. (2018) and used in Fanghella and Thøgersen (2022). The score on the scale (after reversing one item), will be given by the sum of all the items on the scale (scored on a 9-point Likert scale from “totally disagree” to “totally agree”). We include this outcome given the fact that the app narrative relates to the creation of a smart and environmentally sustainable city. By playing each mini-game, the users increase different scores, one of which is called "sustainability" and depends on the type of actions taken inside the game itself.

Our a priori belief is that given groups of variables should be similarly affected by the intervention. While we do not expect these effects to be of the same magnitude or statistical significance, the direction is expected to be the same. Aggregating such groups of variables into one index has the advantage of reducing noise and dealing with multiple hypotheses testing concerns. We will follow Anderson (2008) in computing the indices for each group of variables. Besides the effects on the aggregated indices, we will always report the effects of each component and provide potential explanations in case there are significant differences in the estimated effects, while accounting for multiple hypothesis testing.

In addition, we will evaluate the robustness of our results to multiple hypotheses testing following the latest recommendations in the literature (e.g. Westfall and Young, 1993; Anderson 2008; Romano and Wolf, 2016; Young, 2019 etc.).

Heterogeneity will be investigated, only in an exploratory way, along the dimensions listed below. We will adopt a data-driven approach following the latest methodological developments in the literature (Chernozhukov et al., 2018; Athey and Wager, 2018, Chernozhukov, Demirer, Duflo and Fernández-Val, 2020 etc.).
The baseline level of the single outcome
STEM abilities and math self-efficacy at baseline
Combined scores of the different tests at baseline (rotational, computational, etc.)
Initial high school track choice and STEM-career intentions
Socio-economic background.

Anderson, M. L. (2008). Multiple inference and gender differences in the effects of early intervention: A reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects. Journal of the American statistical Association, 103(484), 1481-1495.
Bati, K. (2018). Computational thinking test (CTT) for middle school students. Mediterranean Journal of Educational Research, 12, 89-101.
Chernozhukov, V., Fernández‐Val, I., & Luo, Y. (2018). The sorted effects method: discovering heterogeneous effects beyond their averages. Econometrica, 86(6), 1911-1938.
Chernozhukov, V., Demirer, M., Duflo, E., & Fernández-Val, I. (2020). Generic machine learning inference on heterogenous treatment effects in randomized experiments. 2018.
Christensen, R., & Knezek, G. (2017). Relationship of middle school student STEM interest to career intent. Journal of education in science environment and health, 3(1), 1-13.
Fanghella, V., & Thøgersen, J. (2022). Experimental evidence of moral cleansing in the interpersonal and environmental domains. Journal of Behavioral and Experimental Economics, 97, 101838.
Kim, Seong-Won & Lee, Youngjun. (2020). Development of Test Tool of Attitude toward Artificial Intelligence for Middle School Students. The Journal of Korean Association of Computer Education. 23. 17-30. 10.32431/kace.2020.23.3.003.
Panzone, L. A., Ulph, A., Zizzo, D. J., Hilton, D., & Clear, A. (2021). The impact of environmental recall and carbon taxation on the carbon footprint of supermarket shopping. Journal of Environmental Economics and Management, 109, 102137.
Romano, J. P., & Wolf, M. (2016). Efficient computation of adjusted p-values for resampling-based stepdown multiple testing. Statistics & Probability Letters, 113, 38-40.
Yoon, S. Y. (2011). Psychometric properties of the Revised Purdue Spatial Visualization Tests: Visualization of Rotations (The Revised PSVT:R) (Doctoral Dissertation). Retrieved from ProQuest Dissertations and Theses.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment (Vol. 279). John Wiley & Sons.
Young, A. (2019). Channeling fisher: Randomization tests and the statistical insignificance of seemingly significant experimental results. The Quarterly Journal of Economics, 134(2), 557-598.

Experimental Design

Experimental Design
The treatment was performed at the class-level within each school, with at least one control and at least one treated class per school. No school participated with less than two classes. The sample of schools was recruited on a voluntary basis by the Ufficio Scolastico Regionale for the Piemonte Region from the population of middle schools in the region. The treated classes received the SMaILE App starting from December 2022 and no later than January 2023, while the control schools did not receive anything, although the teachers were made aware that these classes would also receive the app at the end of the school year.

Experimental Design Details

Randomization Method
Randomization done in office by computer.
Randomization Unit
Class-level randomization stratified by school.
Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters
57 classes, 20 schools. The cluster is the class (unit of randomization).
Sample size: planned number of observations
The classes participating in the evaluation have a total of 1192 students; 1031 students obtained parental consent to participate in the evaluation; 995 filled in the baseline survey.
Sample size (or number of clusters) by treatment arms
28 control classes (504 control students), 29 treated classes (527 treated students)
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Our MDES at 5% significance level and 80% power, assuming an intra-cluster correlation (ICC) of 0.1 is roughly 0.3 standard deviation (SD). At baseline, on most outcomes, we detect an ICC lower than 0.1; however, at the endline, we could expect it to be higher if the outcomes of pupils in the treated group become more correlated within class. The effective MDES in our main specifications will be lower given that, for most of the outcomes, we will control for their baseline levels in addition to other covariates. For instance, with ICC 0.1 and an R2 of 40% of the covariates (including the baseline level when available) on the outcome, the MDES decreases to roughly 0.23 SD.

Institutional Review Boards (IRBs)

IRB Name
Comitato Etico per la Ricerca del Politecnico di Torino
IRB Approval Date
IRB Approval Number
n. 50012/2022 del 26/10/2022,


Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information


Is the intervention completed?
Data Collection Complete
Data Publication

Data Publication

Is public data available?

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials