Experimental Design Details
The study is an asynchronous online survey experiment in which participants will answer questions without any real-time interaction with the others. The study is conducted entirely on Qualtrics.
The experiment is based on a simple Minimum Effort Game (MEG), a coordination game with multiple Pareto-ranked equilibria. In its baseline version, each subject has to choose an effort level; then, the lowest effort level chosen among all group members is the one implemented in the group; payoffs are increasing with the group effort level, but decreasing with the individual effort (keeping group effort fixed). In our design, subjects play in groups of 5 players and can choose an effort level from 0 to 3.
In the baseline treatment, subjects will just play the standard MEG. Then, each "experienced" player will play again the MEG, filling the role of leader of the group, matched with 4 new participants who never played the MEG before. The leaders must write a text where they try to entice the other group members to choose the highest effort level. Then, the leaders see the ChatGPT’s output for the same task, and they can decide whether to keep their own text or use the one generated by the AI. There is no direct interaction between the leader and
the AI: leaders are only shown ChatGPT’s answer first by means of a recorded video, then as written text. Then, the message (written by the leader or generated by ChatGPT) is shown to all other group members, before they decide their effort level. Followers know whether the message is human-written or AI-generated and the leader knows that followers will have this piece of information: there is complete transparency.
In this way, we can observe three different treatments:
• Baseline MEG: played by the (future) leaders’ 1st part
• Leader/Human Message: 2nd part where the leader chooses to use his/her own text
• Leader/AI Message: 2nd part where the leader chooses to use ChatGPT’s output
However, the choice between AI and human message is obviously endogenous. Thus, to have balanced samples and sufficient statistical power to test a potential difference between treatments, we may need to discard some observations, in particular 2nd part leaders’ choices that are redundant. To do so without affecting subjects’ payoffs, at the beginning of the experiment we will tell the leaders that their earnings will depend on their results in the 1st or 2nd part.
In order to understand the mechanisms behind a potential difference between groups where the advice is human-written vs those where the advice is AI-generated, we need to elicit subjects’ beliefs. To do so, we ask two questions to followers, after they have chosen their effort level but before they see the result:
• Which effort level do you think the leader will choose?
• of the other three followers, how many will choose the highest effort level (3)?
We are interested in leaders’ beliefs too: in a similar way, we elicit their beliefs about how many followers will choose the maximum effort.
The quality of the text written by leaders could be a key variable to explain both leaders and followers' choices: to evaluate it, we will run a follow-up survey on a different sample of subjects, asking them to rate the messages written by leaders. Each subject will rate only a subsample (e.g. 20 out of 165) of the total messages. Moreover, we will insert among the messages to rate also the AI-generated message, and ask the subjects to identify it (similar to a Turing Test).
Regarding subjects' earnings, we will pay only 20% of the participants. The last question of the end-of-study survey will be a Beauty Contest: subjects must choose a number between 0 and 100, and the 20% of them who get closer to 15 + 2/3 of the average number chosen will be selected for receiving the payment.
Follow-up update:
To obtain an exogenous measure of the persuasiveness of the messages, we will run a follow-up survey with another sample of subjects to evaluate leaders' messages. We recruit 64 subjects, each one will evaluate the persuasiveness of 5 different messages on a scale from 1 to 10, without knowing the source of the message (AI or Human). Participants were shown the same instructions that the leaders saw in the main experiment.
We will obtain 5 persuasiveness scores for each of these 64 messages:
- 33 messages that were used in the leadership/human message.
- 21 messages which where written by the leaders who chose to use the AI-generated message.
- 10 AI generated messages that where preregistered, among which there is the AI-generated message that was actually used in the main experiment.
Moreover, the subjects will also complete an AI detection task, in which they state the probability that a message is generated by AI from 0% to 100%. They will do the task for 5 messages, all different from the 5 that they evaluated before.
Participants will receive a fixed payment of 5€.