Minimum detectable effect size for main outcomes (accounting for sample
design and clustering)
Approximately 530 workers are expected to return in each of the nine treatment arms and approximately 475 per arm are expected to be in the subsample not hired by any hiring method. For each power calculation I will specify the MDE for the 530 case followed by the 475 case in parentheses.
The primary research question in this follow up is what types of algorithms are effective at reducing perceptions of discrimination. Thus, I am most interested in comparing each of the other treatment groups to the non-blind manager, and I calculate conservative MDEs by focusing on the comparisons of two groups at a time without control variables – pooling arms and adding controls would improve the precision of the estimates.
With this sample size, I will be powered (confidence level = 0.05, power = 80 percent) to detect a 8.5pp (9pp) difference in either direction between each treatment group and the non-blind manager group (based on analytical power calculations with a total sample size of 1060 (950), and assuming that 40 percent of participants perceive discrimination in the non-blind manager group, as in the original experiment (among women and racial minority men, who will make up the whole sample for the follow-up).
Given the sample size needed to obtain the power described above, I can also calculate the MDEs for the differences between other treatment groups, depending on the rate of perceived discrimination in the less-discriminatory group, all of which would be better-powered based on the results from the main experiment. For example, I am interested in testing whether the algorithm that uses demographics is perceived to discriminate more than the algorithm that doesn't, as well as the difference between the arms with the blind manager and the algorithm without demographics in which workers know that mostly white men were hired in the past. Here, the relevant control mean is 20%, not 40%, so I would be powered to detect differences larger than 7.3pp (7.7pp). When one group has near zero percent of participants perceiving discrimination, I can detect differences larger than 2.5pp (2.7pp).
The second outcome is reservation wages for future work, which, between the manager arms is a replication of the original experiment and will only be possible in the algorithm arms if there are still positive rates of perceived of discrimination in some of the algorithm arms. Again focusing on comparing just two arms, the MDE for the effect on a continuous variable is about 0.17sd for either sample size. Instead, pooling the two arms where there will almost certainly be no perceived discrimination (based on the results of the original experiment) and pooling the two arms where there will most likely be positive rates of perceived discrimination between 20-40 percent (based on the results of the original experiment), the MDE is about 0.12sd for either sample size (N=2120 or 1900).