Experimental Design Details
There are 15 inference-problem treatments, described briefly below. Our primary focus is on the fraction of participants whose beliefs fall into different "modes". We hypothesize that these modes will be the following: the base rate, 50-50, (close to) the bayesian answer, and the "likelihood" (i.e., P(Signal | Hypothesis)). We also hypothesize that some respondents will answer with P(Signal & Hypothesis) (i.e., failing to renormalize given the likelihood of the signal conditional on the alternative hypothesis).
1. Balls-and-urns control condition
2. Blue-cab green-cab problem. We hypothesize that this will increase the mode at the likelihood, compared to treatment 1.
3. "Undermine" cabs. We hypothesize this will reduce the mode at the likelihood compared to treatment 2.
4. "Cabified" balls-and-urns. We hypothesize this will increase the mode at the likelihood compared to treatment 1.
5. Balls-and-urns will less extreme likelihood.
6. Balls-and-urns with more extreme likelihood. We hypothesize this will increase the mode at the likelihood and bayesian answer compared to treatment 5, compared to the base rate and P(Signal & Hypothesis).
7. Complicated signal (5 green balls and 4 blue, rather than just 1 green ball). We hypothesize this will boost the mode at the base rate compared to treatment 1.
8. 2 Green Signals. We hypothesize multimodality in these beliefs, but are not comparing them to another treatment.
9. 1 Green Signal, 1 Irrelevant signal.
10. 1 Green Signal, No Irrelevant Signal. We hypothesize that treatment 9 will have an increased mode at the base rate or at 50-50 compared to this treatment.
11. Balls and urns but only explicitly asking about one hypothesis. We hypothesize that this will increase the mode at P(Signal & Hypothesis) compared to treatment 1.
12. "Small Green Urn". Base rate = 50%, P(Green | Jar A) = 50%. P(Green | Jar B) = 100%. But, problem is described in terms of frequencies (how many marbles in each jar). Jar B has 5 green marbles, and Jar A has 5 green and 5 blue.
13. "Big Green Urn" Same as treatment 12, but Jar B has 15 green marbles. We hypothesize a shift away from 50-50, and toward 25% (the ratio of green marbles in Jar B compared to Jar A), compared to treatment 12.
14. "Elementary description" Same statistical problem as Treatment 1, but the probability of each event (e.g. a green marble from Jar A) is described individually. We hypothesize an increased mode around the bayesian answer compared to treatment 1.
15. Elementary description with alternative implicit. We hypothesize this will increase the mode at P(Signal and Hypothesis) compared to treatment 14.
Our model suggests that attention to different features of the problem correlate with which mode beliefs sort into. We will measure attention using both participants' free-text responses of how they solved the problem as well as their answers to the questions asking directly which features they were paying attention to.
We hypothesize that attention to: 1) the color of the blue-vs-green marble/cab will correlate with answering with the likelihood, 2) the urn/cab company will correlate with the base rate, 3) the "match" between urn and signal or to whether the witness's report is correct will correlate with answering with the likelihood, 4) both color/match and urn correlate with the Bayesian answer, 5) nothing correlates with 50-50, and 6) the irrelevant signal correlates with 50-50 or the base rate.
We have six gambler's fallacy treatments:
1. TH vs HH: Asks the relative frequency of these 2-flip sequences of coin flips.
2. THTHHT vs HHHHHH: we hypothesize this will decrease the mode at 50-50 and shift the mean belief down (where lower beliefs correspond to committing the gambler's fallacy more) compared to treatment 1.
3. HHHHHT vs HHHHHH.
4. P(H | HHHHH): Same as treatment 3, except it is emphasized how the problem is asking for the probability that the final flip is heads vs tails conditional on the first five flips all being heads (rather than just asking about the likelihood of each of these sequences as a whole). We hypothesize this will increase the mode at 50-50 compared to treatment 3.
5. Priming control condition. Before participants answer the main question (which will be about THTHHT vs HHHTHH), they will rate 15 pairs of sequences by how similar they are to each other, where this means how many individual flips differ between them (e.g., first flip is heads in one but tails in the other).
6. Priming share heads: Same as treatment 5 but the ratings questions will ask about what share of flips in each sequence are heads vs tails and ask participants to rate differences between them on this basis. We intend this treatment to boost attention paid to share heads in the main gambler's fallacy question (about THTHHT vs HHHTHH), which we will measure using self-reported attention. If it succeeds in sufficiently boosting attention to share heads, we hypothesize that it will reduce the fraction of participants who answer with 50-50 and lower the mean belief (where lower beliefs correspond to exhibiting the gambler's fallacy more).
Our model suggests that attention to the share of heads vs tails will correlate with committing the gambler's fallacy.
In April 2024, we will run additional treatments with the following design. All participants will answer two inference problems and two "compound probability reduction" problems. These will allow us to see whether answers are correlated across problems within person. The second inference problem is always an identical balls-and-urns problem. The first inference problem is either has a low (60%) or high (90%) likelihood, meant to vary the contrast of the signal and therefore boost the share of participants who respond with the likelihood. This design will allow us to test the extent to which treatment effects stemming from altering the first inference problem spillover onto participants' answers to the second inference problem, which is held fixed. We will also vary whether the first inference problem has a balls-and-urns (N=2,500) or taxicabs (N=500) framing to test whether there is a greater correlation between answers within vs across frames. We include a larger sample size for balls-and-urns to maximize precision to test the spillover effects of varying the contrast in the first problem on the second problem (which is always a balls-and-urns problem, and we expect spillovers, if there are any, to arise within rather than across frames).
The compound probability problems are meant to test whether people's answers tend to cluster on the modes our model predicts as well as investigate correlation across problems. The problems simply tell participants a prior (the odds a computer will choose the "orange" vs the "purple" deck of cards) and likelihoods for each deck (the share of cards whose suit is spades). They then ask given these numbers the probability that a spades is drawn. Participants will be evenly split across 4 parameterizations.