Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
Our randomized controlled trial has two experimental arms: a technical arm that receives intensive training and technical support on improved kiln operation and an incentive arm that receives information and nudges on how to incentive workers to correctly adopt the practices in addition to the intensive technical training. Our proposed intervention, when successfully adopted, increases the energy efficiency of kiln operation and increases profit for kiln owners (from reduced spending on coal and greater production of higher quality bricks). In our pilot, 60% of kilns assigned to either treatment arm adopted the two most important technical intervention components.
Using data from our pilot study in Jashore, Bangladesh completed during the 2021-2022 brick firing season, we conduct power calculations for three outcomes that reflect energy efficiency and improved kiln operation: percent of class-1 bricks produced, CO/CO2 ratio, and specific energy consumption. Although the pilot study was too small to estimate these outcomes with precision, the point estimates do provide some empirically based suggestion on potential effect size.
Approach
Based on our pilot results, we have estimated effect sizes for the “intention-to-treat” (ITT) effect of each experimental arm, as well as a “treatment-on-the-treated” (TOT) effect that accounts for imperfect compliance with the intervention (both from kilns assigned to the treatment arm that did not take-up the intervention practices and from control kilns that sought to learn the intervention practices) by using random assignment to both arms as an instrument for adoption. These results for each of the three outcomes are summarized in Table 1 below. We first calculate the minimum detectable effect size (MDES) assuming both arms have equal effect sizes, a significance level of 0.05 and power of 0.9. Then, because there is suggestive evidence from our pilot that the incentive arm encouraged better adherence to the improved operating practices and resulted in better outcomes, we also calculate our statistical power for detecting differences between the incentive and technical arms.
These scenarios indicate that with a sample size of 100 kilns per experimental arm (300 total kilns), we are powered for all three outcomes with 90% power in most cases. For class-1 bricks the incentive arm performed much better, producing 7.12 percentage points more class-1 bricks than the control group and we would be powered to detect an effect size of this magnitude with only 25 kilns per arm. The effect size for the technical arm was much smaller (2.1 percentage points higher than the control group) and with 100 kilns per arm we would not be powered to detect such a small difference. However, 2.1 percentage points is an extremely conservative estimate for a potential effect size. The minimum detectable effect size for 100 kilns per arm at 90% power is 3.56 percentage points. This is half the magnitude of the incentive arm and still relatively conservative, particularly when considering the TOT estimate of 9.22 percentage points among adopters.
For the CO/CO2 ratio, with 100 kilns per arm we almost are powered for the more conservative ITT effect attained by the incentive arm but more than sufficiently powered to detect the larger effect size attained by the technical arm. With 100 kilns per arm at 90% power, we are powered to detect an effect size of -0.0064 in the CO/CO2 ratio, while we would need only 65 kilns per arm to detect an effect as large as -0.008, which is what the technical arm attained in the pilot. Somewhat surprisingly, the measured CO/CO2 ratio in the pilot was lower in the technical arm than in the incentive arm. This may simply reflect that the CO/CO2 ratio is a cross sectional measure that we captured based on data from a few hours in each kiln and so may not accurately reflect the performance over the whole season. Indeed, the first CO/CO2 ratio was measured before the incentive arm was even rolled out. Nevertheless, the calculations suggest that we will have sufficient power to be able to detect changes in CO/CO2 ratio with the interventions.
Similar to percent of Class-1 bricks, our pilot results suggest kilns assigned to the incentive arm had a much lower specific energy consumption (SEC). While we will not be powered to detect effect sizes as small as what the pilot found in the technical arm, we are powered to detect effect sizes smaller than what the technical arm attained. With 100 kilns per arm at 90% power, we are powered to detect an effect size of -0.065 in SEC, while we would need 70 kilns per arm to detect an effect as large as -0.083, which is the ITT effect for the incentive arm compared to the control group.
Practical Considerations
We have focused on the threshold of 100 kilns per arm due to practical and logistical considerations of implementing such a complex intervention with fidelity. Since brick kilns initiate firing within about three weeks of each other, we face an outsized requirement for trained implementers to support the intervention in the few weeks leading up to kiln firing and the initial weeks of the season. Based on our pilot experience, we are confident that with the planned staff we can implement the intervention in 100 kilns per arm. Increasing the study size to > 100 kilns per arm would not only generate more expenses than we could cover but would impact the fidelity of the implementation. A poorly implemented technical intervention among 200 kilns per experimental arm would be unlikely to generate effect sizes of similar magnitude to the pilot if a small percentage of kilns adopts the intervention.
Conclusion
Balancing the practical and logistical considerations of the RCT with the power calculations, we are confident that we will be able to implement a well-executed trial among 300 kilns (with 100 kilns per arm).
Our pilot achieved 60% uptake of the two most important components of the technical intervention. We used outcomes associated with this level of uptake for these power calculations. We anticipate that by implementing lessons learned in the pilot we will achieve a higher uptake in the full trial, which will further improve power.
Even with conservative assumptions a 300-kiln study will be powered to detect effect sizes that are well within the bounds of what we observed in our pilot study and would be powered to detect a difference between treatment arms in Class-1 brick production with 80% power.