Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
The data will be analyzed using simple t-tests for difference in means and linear regression with controls to increase precision. The sample frame will include all currently enrolled TAP participants. Philadelphia’s Department of Revenue provided data that TAP had 65,242 participants as of April 30, 2025. Due to potential data limitations and participants dropping out of the program we will use 60,000 as a conservative expected sample size. Discussions with Philadelphia confirmed that all eligible participants would be enrolled in the experiment. We have four research questions with four different outcome variables and each uses a different treatment to test each research question. For all power calculations we use conventional thresholds of 0.05 for the size of the test and 0.80 for power. Since we do not have data in-hand, we frame all power calculations as standardized minimum detectable effect (MDEs) sizes. Each hypothesis test measures the average treatment effect (ATE) of receiving the behavioral intervention.
MDE for T1 (RQ1): As described above, 20,000 out of the 60,000 households facing a re-enrollment decision in the next year will receive T1: the loss aversion-framed social comparison that highlights the financial benefits of remaining in TAP. We will assign 10,000 of the 20,000 households facing re-enrollment to T1 and the other half to the control group (C1). The MDE for T1 vs. C1 is 0.040.
MDE for T2 and T3 (RQ2 and RQ3): The remaining 40,000 households will be randomized into a separate control group (C2; 16,800), the intervention targeting on-time payment (T2; 11,600), and the intervention targeting water conservation (T3; 11,600). We are testing different outcomes and are not testing the treatments against each other. The MDEs for T2 vs. C2 and T3 vs. C2 are both 0.032.
For comparison, a systematic review of randomized trials using social comparisons for water conservation found an average effect size of 0.15. Therefore, we believe the experiment will be sufficiently powered to rule out meaningful treatment effects. The MDEs for RQ4 will be the same as for RQ1 and RQ2. Adjusting the T1 and T2 for multiple hypothesis testing using the conservative Bonferroni adjustment leads to MDEs of 0.044 for T1 and 0.037 for T2. Prior research generated a match rate of over 80% using names and addresses and adjusting the MDEs for this reduced sample and multiple hypothesis testing increases the MDEs to 0.05 and 0.042 respectively. In practice, since the hypotheses are correlated the Bonferroni adjustment is too conservative and we will use modern methods to correct for multiple hypotheses testing.