Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
The power calculations for this study are based on the public use data distributed for an evaluation of the teacher certification program in Indonesia, as detailed in the study Double for Nothing? Experimental Evidence on an Unconditional Teacher Salary Increase in Indonesia by Joppe de Ree and others (The Quarterly Journal of Economics, Volume 133, Issue 2, May 2018, Pages 993–1039). We use Table A6, which presents the results for language for primary schools as the base specification. De Ree et. al (2018) conducted their study using a representative sample of 10 districts across Indonesia and tested almost all grades. Our calculations were performed in Stata, using the estimation files provided by De Ree et. al (2018) and J-PAL’s programs for power calculations.
We assume a 100 percent take up rate. Previous evaluation studies conducted by the Ministry indicate that there is no record of schools not participating in the training. There may be an issue with book delivery. Although all schools signed for receipt, a survey of PSKP (policy evaluation unit of the Ministry) indicated many respondents were not aware that new books had arrived. During the study we will monitor and encourage timely book delivery.
With randomization at the school level, no controls, 0.8 power, the minimum sample required for an effect size of 0.1 is 451 schools per group. For an effect size of 0.2, this would be reduced to 113. Including the controls that were included in the aforementioned study (which include strata dummies and baseline test scores) the required sample size drops to 208 for 0.1 std dev. effect size and 58 for 0.2 std dev effect size. These calculations are based on the assumption of 20 students tested per school.
We use the 0.1 standard deviation as our effect size for the literacy score. This is a program that mostly provides inputs (reading books) and only has a light training component (4 days centralized training by 2 delegates from each school). Comparable programs found negligible to very minimum impact on reading skills. In the Indian context, the training of librarians on library operations and on educational literacy activities demonstrated minimal impact on language scores (Borkum et al, 2012). Meanwhile the provision of additional textbooks in Kenya had limited overall influence, primarily benefiting higher-scoring students with test scores increasing by 0.14 to 0.22 standard deviations, but such initiative resulted in positive changes in teacher behavior (Glewwe et al, 2009; Sabarwal et al, 2014). Conversely, in the Philippines, a more intensive literacy program yielded noteworthy improvements, increasing reading scores by 0.13 standard deviations (Abeberese, 2014).
For our study, we will not have access to student level baseline tests. We will have access to 2022 and 2023 individual national assessment data in grade 5, which will give us an indication of the baseline literacy level in the school.
Our final study sample is randomized at the school level and contains 504 treatment and 496 control schools. It is spread out over 10 provinces. It is thus sufficient to detect an effect of 0.1 std dev improvement in literacy scores. As we will be able to condition on strata fixed effects and literacy scores of the school in the baseline national assessments, we think we overpowered the 0.1 standard deviations effect size.