Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
Since this pre-registration text precedes the running of the pilot and of the full study, I rely on the literature to obtain a rough sense of the adequate sample size. I focus mainly on treatment effects taking direct questioning (condition 1) as the control condition, and each of the other three conditions as separate treatments. I focus on two studies where the sensitive item concerns some form of cheating, and on a general meta-analysis:
• Ocantos et al (2012) study vote buying using the ICT. They find that only 2.4% (s.e.=0.6%) of subjects reported receiving an individual gift in echange for their vote during a 2008 election in Nicaragua when asked directly, but 24% (s.e.=5.5%) reported having received such a gift when asked through the ICT. Assuming equal variances for the treatment and control groups, and supposing the variance is numerically equal to that estimated for the treatment group (a conservative choice in this case), power is close to 1.0 even for a sample size as small as 30 (since the effect size is so large).
(Note: the formula I'm using, in Latex code, is: \beta=\Phi(\left[|\mu_{t}-\mu_{c}|\sqrt{N}\right]/2\sigma-\Phi^{-1}(1-\alpha/2)), where \beta denotes power, \Phi is the cumulative Normal distribution, \Phi^{-1} is the inverse of such distribution, \sigma is the standard deviation of the outcome for both the treatment and the control groups, N is the sample size, and \alpha is the level of statistical significance (Gerber and Green 2012)).
• Van der Heijden et al (2000) compare RRT with direct questioning in both a face-to-face survey and a computer-assisted survey. The proportion of subjects known to have engaged in income fraud who admitted to it was 43% (s.e.=6.8%), 25%(s.e.=4.4%), and 19%(5.8%) respectively for RRT, face-to-face direct questioning, and computer direct questioning. The effect size here is also so large that even 30 subjects suffices to attain power close to 1.0 (assuming, for example, that the standard error of the outcomes is 6.8%, and choosing 43%-25%=18% as the magnitude of the treatment effect).
• In a meta-analysis of RRT, Lensvelt-Mulders (2005) find that the mean percent underestimation of a sensitive item using RRT is 38% (s.e.=.099), while it is 42% (s.e.=.099) in face-to-face interviews, 46% (s.e.=.138) in phone interviews, 47% (s.e.=.14) in self-administered questionnaires, and 62% (s.e.=.191) in computer-assisted surveys. Comparing the rate of reporting of the sensitive behavior under RRT with that under self-administered questionnaires, and taking the standard error to be 0.14, a sample size of 75 is necessary to attain power of 0.8.
In sum, while I face considerable uncertainty about effect sizes and variances before running the study, an N of 50 to 70 per treatment condition seems reasonable. It is not clear whether a small pilot study (with 15-30 respondents) would suffice to reduce this uncertainty in a meaningful way. To further improve power, to improve covariate balance, and to reduce the variability, I will estimate treatment effects adding pre-treatment covariates as control variables, and (alternatively) I will implement sequential blocking on pre-treatment covariates (after having collected the data, but without utilizing outcome data for blocking; see Moore and Moore 2011).