Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
We are grateful to Kenya Heard and Rohit Naimpally for assistance with power calculations for this intervention. The power estimates are based on the assumption we will have 46 schools. Even from a conservative estimate of only 16 students per school, we obtain adequate power for detecting a doubling effect on a vocabulary test. We assume that the assessment will ask children to identify 30 words, 15 of which are only covered in the BWC curriculum and 15 of which are more general and that the treatment group gets at least twice as many of the first set of 15 words correct as the control group. Furthermore for the intra-school correlation, we consider two commonly used scenarios: 0.2 and 0.4. The lower one is fairly standard in education, and to be safe, we also considered something on the higher end i.e. 0.4. Figure 1 shows the minimum detectable effects under 14 different scenarios (7 for each of the intra-cluster correlation assumptions) for the number of words that the control group might get right on the BWC part of the curriculum. For power, we're assuming that the experiment should have at least 80% power (at a significance level of 0.05).
We consider whether the minimum detectable effect size under each of the 14 scenarios translates into the treatment group getting at least twice as many words right. For instance, under an ICC of 0.2, if the control group children were to get 1 out of 15 BWC words correct (~7% of the words), the experiment would be able to detect an effect equal to a 14 percentage point increase for the treatment group (translating to the treatment group getting ~3 words correct out of 15). Since this minimum detectable effect size is equal to a gain of more than twice as many words for the treatment group, under these conditions, the experiment would not be sufficiently powered. But for all cases where the control participants answer an average of 3 questions correct or more and treatment participants answer more than twice correct, there is sufficient power, and there is sufficient power in the case of the control participants answering 2 correct answers on average when the intra-school correlation is 0.2. These plausible cases suggest we have adequate power, given that the main outcome will include the specific words the program tries to teach. We expect to have more than 16 students per school and more than the 46 schools, both of which will increase power further. We will also try to obtain additional background variables such as gender, race, home language and Teaching Strategies Gold scores so that we might condition on background variables to reduce the intra-school correlation further.