Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
We use results from a pilot study on Prolific to calculate the minimum detectable effect size, assuming a Type I error rate of 5%, 80% power, a mean score of 50 (out of 100), and a standard deviation of 30. To be conservative, we use a slightly higher standard deviation than what was observed in the pilot study. The treatment-to-control ratio is 3:1, reflecting our design with three treatment arms and one control group.
We will target 800-1,000 complete responses. With a sample size of 800, the minimum detectable effect size is 10 percentage points (p.p.), meaning the treated group must improve by at least 10 p.p. more than the control group for the effect to be detectable. As the sample size increases, the minimum detectable effect decreases: with 1,000 observations, it drops to 9 p.p.; with 1,200 and 1,400 observations, it decreases further to 8 p.p.; and with 1,600 observations, it reaches 7 p.p.
We plan to examine the correlation between participants' willingness to pay for each treatment and their baseline score in the financial decision-making task, then test whether these correlations differ across treatments. Due to funding constraints that limit our sample size, we do not expect to be adequately powered to detect differences in the slope of the demand curve across the different versions of the education material at conventional significance levels. Based on our power calculations, with a sample of 800 to 1,000 participants, we can only detect differences in correlations of approximately 15–20 percentage points or more–that is, the correlation between the baseline financial literacy score and the willingness to pay would need to differ by at least 15–20 percentage points across treatments to be statistically detectable. We believe such large differences are unlikely given the similar nature of the education materials. We may be able to detect differences in demand for the different types of education materials when we bin the groups (for example by high and low baseline ability).