Experimental Design Details
We conduct the randomized controlled online experiment to examine the causal effects of different AI decision-support settings—including AI explanations and AI uncertainty scores—on performance outcomes and users' perceived uncertainty. Participants will be recruited from the crowdworking platform Prolific. We will select a US-based sample for the experiment. Participants are randomly assigned to one of the four between-subjects treatment conditions. Consequently, participants either only received AI-generated recommendation, AI recommendations along with AI explanations, AI recommendations along with AI certainty scores, or AI recommendations along with AI explanations and AI certainty scores. This design aims to isolate and assess the impact of AI explanations and AI certainty scores on decision performance and user perceptions.
The online experiment employs a between-subjects design, i.e., each subject participates in exactly one of the four treatments. Subjects are aware of the treatment conditions they are receiving, but they do not know that it is a treatment, nor do they know about the other treatments. Treatments are randomized at the individual subject level.
Participants in the experiment have to complete a series of credit score classification tasks. Participants are given applicants' credit-related information, and their goal is to predict the correct credit score category as one of three classes: Poor, Standard, or Good. The experimental task is based on a credit scoring dataset from the online platform Kaggle. For the experiment, we randomly selected 36 instances from the dataset that are displayed to the participants, while ensuring a balanced set of takes in terms of credit scores (good, standard, poor) and difficulty as indicated by AI certainty scores (low and high certainty). In addition, the accuracy of the AI predictions is the same in all phases and reflects the overall accuracy of the AI system (about 71 %). The selected instances are the same for each treatment but are randomized in their order within each phase to avoid order effects.
To assist participants to make accurate classifications, they are presented with six key characteristics of each applicant. These include: Age, Annual Income, Number of Credit Cards, Interest Rate on Credit Card(s), Outstanding Debt, and Occupation. Their objective is to accurately predict the credit score of each applicant according to the ground truth in the original dataset. Participants receive a fixed participation fee of £3.00 and a bonus payment of £0.15 for each correctly classified applicant. Accordingly, total compensation ranges from a minimum of 3.00 to a maximum of £8.40. Based on findings from the pilot study, the experimental session is expected to last about 30 minutes.
To maintain the quality and validity of the data, iterative recruiting will be implemented based on participants’ performance on attention checks. If a participant fails an attention check, additional participants will be recruited to replace them. This iterative approach ensures that the final sample consists of high-quality data while adhering to the study's pre-defined inclusion criteria and timeline. Recruitment for replacements will be conducted promptly to minimize delays and maintain a consistent data collection process.
If a subject does not complete the experiment, it will be excluded from the analysis. Subjects with missing data will also be excluded, as this indicates a technical error occurring for them during the experiment.