Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
We use results from a pilot study on Prolific to calculate the minimum detectable effect size, assuming a Type I error rate of 5%, 80% power, a mean score of 50 (out of 100), and a standard deviation of 30. To be conservative, we use a slightly higher standard deviation than what was observed in the pilot study. The treatment-to-control ratio is 3:1, reflecting our design with three treatment arms and one control group.
We will target 1,000-1,200 responses. According to our power calculations, with a sample size of 800 the minimum detectable effect size is 10 percentage points (p.p.), meaning the treated group must improve by at least 10 p.p. more than the control group for the effect to be detectable. As the sample size increases, the minimum detectable effect decreases: with 1,000-1,200 observations, it drops to 9 p.p.; with 1,200-1,400 observations, it decreases further to 8 p.p.; and with approximately 1,600 observations, it reaches 7 p.p.
We plan to examine the correlation between participants' willingness to pay for each treatment and their baseline score in the financial decision-making task, then test whether these correlations differ across treatments. Due to funding constraints that limit our sample size, we do not expect to be adequately powered to detect differences in the slope of the demand curve across the different versions of the education material at conventional significance levels. Based on our power calculations, with a sample of 1,000 participants, we can only detect differences in correlations of approximately 15–20 percentage points or more–that is, the correlation between the baseline financial literacy score and the willingness to pay would need to differ by at least 15–20 percentage points across the treatment arms for the differences to be statistically significant. We believe such large differences are unlikely since we highlight the similar nature of the education materials to the participants. We may be able to detect differences in demand for the different types of education materials when we bin the participant groups (for example by high and low baseline ability) or when pooling the AI and non-AI versions of the education material.