Experimental Design
We aim to investigate these research questions in the context of an experiment that is embedded in
a realistic machine learning prediction task. Participating professional software developers have the
task to develop an ML model for real estate price estimations using a provided proprietary dataset.
The experimental design involves four groups differentiated by (i) the availability of pre-defined
XAI tools and (2) the availability of own mental models in the context of the prediction task. We
administer the XAI treatment by providing participants with a pre-defined, easy-to-use XAI function
that automatically shows different kinds of intuitive model explanations.
To control the availability of mental models about the prediction task, we mask the feature names so
that participants cannot infer the meaning of the original features. To prevent participants to infer
the meaning of the original variables from the variable distributions, we further normalize the values
of our dataset. We apply the normalization consistently across all experimental groups to maintain
comparability.
The experiment will be conducted by delegating prediction tasks to developers via Amazon MTurk, allowing participants to engage remotely. We developed the user interface of the task explanation, survey-based data collection, and distribution of datasets and further materials using the oTree framework, developed in Python (Chen et al., 2016), to which MTurk participants are forwarded. Upon entering the Otree environment, participants are randomly assigned to one of the four previously explained experimental groups using the itertools.cycle() function in Python, ensuring individual random assignment. Although the online setting reduces control over participants, it better simulates real-world data challenges and fosters more natural developer behavior. The task involves predicting apartment prices based on a proprietary dataset containing various features such as the number of rooms or construction year of the apartment. We compiled the dataset by scraping 5090 apartment listings from a large online platform, focusing on the seven largest cities in Germany during 2022. The dataset contains different features, such as the listing price per square meter, construction year, and the presence of balconies and basements. We further augmented the data with third-party information, namely the percentage of Green Party voters and unemployment rates. Before starting the prediction task, we ask participants to sort the available features in descending order of their perceived importance for the apartment price predictions. This task serves to elicit participants prior beliefs about the influence of each variable on apartment prices, which we refer to as their mental model. Participants will indicate whether a variable has a positive or negative impact and the approximate magnitude of this impact. Within the Otree environment, each participant will receive a Github repository containing basic instructions for the prediction task and a link to a Google Colab notebook. Participants will develop their models within this notebook and submit their final solutions. We have prepared four distinct GitHub repositories, each corresponding to a different experimental group to which participants were assigned earlier. In conditions involving XAI (Group A1 and B1, see table 1), we have incorporated a comprehensive XAI function directly into the Colab notebook. Participants can easily call this function by providing their trained prediction model along with the train and test data as inputs. This function automatically computes importance scores for all features in the form of SHAP values. SHAP values are a well-suited XAI method for our task as they provide reliable local and global feature importance-based explanations using intuitive visualizations. The function generates an interactive dashboard in the Colab code cell output frame. Within this dashboard, participants can activate three different SHAP visualizations: the bar plot, beeswarm plot, and interaction plot (Ponce-Bobadilla et al., 2024). Furthermore, we provide explanation narratives based on these SHAP values inspired by Martens et al. (2025). We introduce the function in the first Colab cell with a text summarizing all important information and guidelines for using it. Participants will have 10 days to complete the data challenge. After the experiment, participants will complete a post-experiment survey assessing their confidence in their submitted models and other secondary measures. This subjective measure will provide insights into how different treatment conditions affect participants’ perceived performance. To assess possible changes of the mental models, we ask participants again to sort the available features in descending order of their perceived importance. We will begin with a pilot study of 10 participants to refine the experimental design and materials. Afterward, the study will scale to approximately 200 participants, with 50 per treatment group, ensuring sufficient statistical power. Participants will be compensated above U.S. minimum wage, and the top performer in each group, evaluated on a hold-out dataset, will receive an additional prize payment.