Field
Power calculation: Minimum Detectable Effect Size for Main Outcomes

Before
All calculations assume power of 80 percent and a significance level of 0.95. I will recruit 3,600 participants to complete the screening survey. I expect takeup for the followup survey to be high, since the screening survey's initial description will indicate that there is a wellpaid follow up survey. Assuming that 80 percent of workers complete the followup survey and 92 percent of workers are assigned to the easier task, the final analysis sample will be around 2,640 workers assigned to the easier task by their randomly assigned mechanism or around 2,304 workers assigned to the easy task by any of the three mechanisms (assuming 20 percent of workers are assigned to the hard task by at least one mechanism, i.e. that the mechanisms’ decisions are slightly but not strongly correlated). The analysis might use either of these samples to deal with “selection” issues, as described above.
1. First stage regressions
1.1. Main effects. Regressions to test whether the demographicblinded manager reduces perceived discrimination relative to the manager who knows demographics are powered to detect effects larger than 6.5 or 7 percentage points (for sample sizes of 2,640 or 2,304, respectively). Similarly, testing whether one of the algorithm subgroups reduces perceived discrimination relative to the manager who knows demographics is powered to detect effects larger than 8 percentage points for either sample size. Given the results from a pilot study, the effect sizes are expected to be larger than these MDEs.
1.2. Treatment effect heterogeneity. Treatment effect heterogeneity is powered as follows: tests of whether the effect of the algorithm depends on whether the worker knows the race of the historically assigned workers are powered to detect differences larger than 9.5 or 10 percentage points (for sample sizes of 2,640 or 2,304, respectively). Tests of whether the effect of the demographicblind human differs from one algorithm subgroup are powered to detect differences of 8 percentage points and tests that the effect of the demographicblind human differs from both algorithm subgroups (which are pooled and don't differ from each other) are powered to detect differences larger than 6.5 or 7 percentage points (for sample sizes of 2,640 or 2,304, respectively).
1.3. Racial and gender heterogeneity. Racial and gender heterogeneity is powered as follows: when testing for heterogeneity in the effects of the blinded manager, gender heterogeneity among nonwhite participants and racial heterogeneity among men are powered to detect differences in the treatment effect of 15 percentage points, gender heterogeneity among white participants is powered to detect differences in the treatment effect of 18 percentage points, and racial heterogeneity among women is powered to detect differences in the treatment effect of 20 percentage points. For each group, testing for heterogeneity in the effects of the algorithm are powered to detect MDEs about 3 percentage points larger than the MDEs for differences in the effects of the blinded manager. These MDEs come from simulations with a sample size of 2,640; with a sample size of 2,304 each MDE is about 1 percentage point larger.
2. Reduced form regressions
2.1. Binary outcomes. Regressions to test the effects of the demographicblind manager on the binary measures of retention (completing only the minimum 6 paragraphs, or completing all 18 paragraphs) are powered to detect effects larger than 5 or 5.5 percentage points (for sample sizes of 2,640 or 2,304, respectively), and tests of the effect of one algorithm subgroup are powered to detect effects larger than 6 or 6.5 percentage points (for sample sizes of 2,640 or 2,304, respectively). In pilot data, 12 percent of workers completed only 6 paragraphs and 68 percent complete all 18 paragraphs, which is assumed in these calculations.
2.2. Continuous outcomes. All other outcomes are continuous. Regressions to test the effects of the demographicblind manager are powered to detect effects larger than 0.14sd or 0.15sd (for sample sizes of 2,640 or 2,304, respectively), and tests of the effect of one algorithm subgroup are powered to detect effects larger than 0.17sd (for either sample size).
3. Twostage least squares regressions
3.1. The twostage least squares power calculations assume that the effects of the treatments on perceived discrimination are quite large, effectively taking the rate of perceived discrimination to zero in the demographicblind manager group and both algorithm subgroups. This is consistent with piloting (though in very small samples).
3.2. Then, twostageleastsquares regressions are powered to detect effects of reducing perceived discrimination that are larger than 4 or 5 percentage points on the binary outcomes (for sample sizes of 2,640 or 2,304, respectively), and effects larger than 0.12sd on the continuous outcomes (for either sample size).

After
All calculations assume power of 80 percent and a significance level of 0.95. I will recruit 3,600 participants to complete the screening survey. I expect takeup for the followup survey to be high, since the screening survey's initial description will indicate that there is a wellpaid follow up survey. Assuming that 80 percent of workers complete the followup survey and 92.5 percent of workers are assigned to the easier task, the final analysis sample will be around 2,664 workers assigned to the easier task by their randomly assigned mechanism or around 2,304 workers assigned to the easy task by any of the three mechanisms (assuming 20 percent of workers are assigned to the hard task by at least one mechanism, i.e. that the mechanisms’ decisions are slightly but not strongly correlated). The analysis might use either of these samples to deal with “selection” issues, as described above.
1. First stage regressions
1.1. Main effects. Regressions to test whether the demographicblinded manager reduces perceived discrimination relative to the manager who knows demographics are powered to detect effects larger than 6.5 or 7 percentage points (for sample sizes of 2,664 or 2,304, respectively). Similarly, testing whether one of the algorithm subgroups reduces perceived discrimination relative to the manager who knows demographics is powered to detect effects larger than 8 percentage points for either sample size. Given the results from a pilot study, the effect sizes are expected to be larger than these MDEs.
1.2. Treatment effect heterogeneity. Treatment effect heterogeneity is powered as follows: tests of whether the effect of the algorithm depends on whether the worker knows the race of the historically assigned workers are powered to detect differences larger than 9.5 or 10 percentage points (for sample sizes of 2,664 or 2,304, respectively). Tests of whether the effect of the demographicblind human differs from one algorithm subgroup are powered to detect differences of 8 percentage points and tests that the effect of the demographicblind human differs from both algorithm subgroups (which are pooled and don't differ from each other) are powered to detect differences larger than 6.5 or 7 percentage points (for sample sizes of 2,664 or 2,304, respectively).
1.3. Racial and gender heterogeneity. Racial and gender heterogeneity is powered as follows: when testing for heterogeneity in the effects of the blinded manager, gender heterogeneity among nonwhite participants and racial heterogeneity among men are powered to detect differences in the treatment effect of 15 percentage points, gender heterogeneity among white participants is powered to detect differences in the treatment effect of 18 percentage points, and racial heterogeneity among women is powered to detect differences in the treatment effect of 20 percentage points. For each group, testing for heterogeneity in the effects of the algorithm are powered to detect MDEs about 3 percentage points larger than the MDEs for differences in the effects of the blinded manager. These MDEs come from simulations with a sample size of 2,664; with a sample size of 2,304 each MDE is about 1 percentage point larger.
2. Reduced form regressions
2.1. Binary outcomes. Regressions to test the effects of the demographicblind manager on the binary measures of retention (completing only the minimum 6 paragraphs, or completing all 18 paragraphs) are powered to detect effects larger than 5 or 5.5 percentage points (for sample sizes of 2,664 or 2,304, respectively), and tests of the effect of one algorithm subgroup are powered to detect effects larger than 6 or 6.5 percentage points (for sample sizes of 2,664 or 2,304, respectively). In pilot data, 12 percent of workers completed only 6 paragraphs and 68 percent complete all 18 paragraphs, which is assumed in these calculations.
2.2. Continuous outcomes. All other outcomes are continuous. Regressions to test the effects of the demographicblind manager are powered to detect effects larger than 0.14sd or 0.15sd (for sample sizes of 2,664 or 2,304, respectively), and tests of the effect of one algorithm subgroup are powered to detect effects larger than 0.17sd (for either sample size).
3. Twostage least squares regressions
3.1. The twostage least squares power calculations assume that the effects of the treatments on perceived discrimination are quite large, effectively taking the rate of perceived discrimination to zero in the demographicblind manager group and both algorithm subgroups. This is consistent with piloting (though in very small samples).
3.2. Then, twostageleastsquares regressions are powered to detect effects of reducing perceived discrimination that are larger than 4 or 5 percentage points on the binary outcomes (for sample sizes of 2,664 or 2,304, respectively), and effects larger than 0.12sd on the continuous outcomes (for either sample size).
