Experimental Design Details
We ran a preparatory stage which we refer to as "previous experiment". We invited participants to create series of sequential and binary outcomes that we use in our main experiment. We used daily historical S&P500 trading data and showed subjects information about the past performance of a stock, i.e., whether the stock has gone up or down on 5 consecutive trading days. We then asked them to predict whether the stock trended up or down on the 6th day. Each participant made nine such predictions for different, randomly chosen stocks from random points in time. Analogously, we performed the same task with AI-algorithms (ChatGPT and Microsoft Copilot). We also generated sequences from dice rolls where success depended on pre-defined thresholds.
In the first stage of our main experiment, we invite a new set of participants and randomly allocate them to one of three treatments. Participants are provided with 24 sequences from the preparatory stage. The sequences are identical across treatment, but the source of outcomes varies, i.e., sequences originate from a human, an AI-algorithm, or dice rolls depending on treatment. Each sequence consists of 8 subsequent outcomes. Subjects see the sequences in random order and choose whether to count one the ninth outcome in each sequence to be correct or not. In 16 of these sequences either the first half or the second half is a streak of 4 identical outcomes. The other 8 sequences alternate between successes and failures more frequently. Each sequence is provided in normal, reversed, and inversed order. Our sequences thus differ with respect to successes, streaks of successes or failures, and whether streaks are in the first or the second half of the sequence. This allows us to investigate reactions across treatments.
In the second stage, we randomly allocate participants to one of two treatments. We provide information regarding success in the first stage and let participants choose between another sequence which originates from a human or an algorithmic source. For half of our participants, the correct answer in the first stage is aligned with the modal answer of a previously run pilot. Naturally, one would expect a high rate of correct answers for participants in this treatment. For the other half, the incorrect answer is aligned with the modal answer of the pilot. Note that this treatment-dependent distinction of correct answers affects only 23 of 24 sequences. We can only utilize variation in the correct answer for sequences if we found such variation in the ninth outcome in the preparatory stage (i.e., no deception). This intervention allows us to study how participants react to variation in their correct answers across treatments. We analyze whether participants in the human/algorithmic treatment switch from their first stage source, and consider the dice treatment to serve as an informative benchmark to compare to.
The experiment concludes by eliciting a set of control variables. In particular, we measure statistical literacy, CRT scores, self-reported AI expertise and beliefs, gambler's fallacy, and self-reported demographics such as age and education level.