The Social Dilemma of Big Data

Last registered on September 14, 2020

Pre-Trial

Trial Information

General Information

Title
The Social Dilemma of Big Data
RCT ID
AEARCTR-0006241
Initial registration date
September 13, 2020

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
September 14, 2020, 7:38 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
University of Bremen

Other Primary Investigator(s)

PI Affiliation
University of Bremen

Additional Trial Information

Status
In development
Start date
2020-09-13
End date
2020-09-25
Secondary IDs
Abstract
Decision-support systems can influence people in various domains of life. Firms have started implementing these systems via chatbots and other natural language-based assistants. While benefiting from these decision-support systems, individuals provide sensitive and valuable data to the private industry. By using this data, companies may generate additional profits. Moreover, by making their data available, individuals may also promote the common good. Regulators have recently called for efficient tools to let the public collectively benefit from its own data. The study investigates how people may voluntarily upload their data to a database that is used to develop and operate a decision-support system to promote a public good.
External Link(s)

Registration Citation

Citation
Hillebrand, Kirsten and Lars Hornuf. 2020. "The Social Dilemma of Big Data." AEA RCT Registry. September 14. https://doi.org/10.1257/rct.6241-1.0
Experimental Details

Interventions

Intervention(s)
Intervention Start Date
2020-09-13
Intervention End Date
2020-09-25

Primary Outcomes

Primary Outcomes (end points)
Individual willingness to upload personal data to a database that is used to develop and operate a decision-support system to promote a public good.
Primary Outcomes (explanation)
Participants indicate their willingness to provide data (WPD) to a database on a 1-100 slider. The variable is the response to the following question: "How inclined are you to upload your personal data to the database?" Scales: 0% = Not at all likely; 100% = Extremely likely.

Secondary Outcomes

Secondary Outcomes (end points)
(1) Moral obligation to provide data
(2) Relative willingness to provide data, depending on who is the managing party of the database
(3) Relative willingness to provide data, depending on the decision-support system's underlying type of algorithm
(4) Differences in willingness to provide data depending on the public good to be promoted (sustainable health system / sustainable environment)
Secondary Outcomes (explanation)
(1) Participants from each treatment combination indicate their moral obligation to provide data to a database on a 5 point Likert scale. The variable is the mean response to the following two questions:
1: "Do you think that there is a moral obligation for people to upload their personal data to the database?" Scales: 1 = It would be wrong for people to upload their personal data to the database; 3 = People don’t have to upload their personal data to the database, but it would be nice if they did; 5 = People must upload their personal data to the database.
2: "How morally wrong is it if people do not upload their personal data to the database?" Scales: 1 = Perfectly fine; 3 = Neither fine, nor wrong; 5 = Deeply wrong.

(2) + (3) Participants indicate their relative willingness to provide data (rWPD) to a database on 1-100 sliders that have to sum up to 100 points in total. The variable is the response to the following question: "How inclined are you to upload your personal data to each database?" Scales: 0% = Not at all likely; 100% = Extremely likely. WPD and rWPD rely on a similar question item. However, they differ in their respective answering options. For WPD participants respond using a single slider. To identify the rWPD variable participants will have to use multiple sliders in relation to each other. The sum of these sliders must equal 100. Hence, indicating the individual willingness to provide data to one of the databases negatively correlates with their willingness to provide data to the alternative database.

(4) In both domains, sustainable health system and sustainable environment, participants indicate their willingness to provide data to a database on a 1-100 slider. The variable is the response to the following question: "How inclined are you to upload your personal data to the database?" Scales: 0% = Not at all likely; 100% = Extremely likely.

Experimental Design

Experimental Design
We plan an online study with experimental manipulations that rely on between- and within subject designs. The experimental treatments are followed by an online survey to control for confounding variables and characteristics. The experiment follows a 3 x 3 design. The participants will randomly be confronted with varying levels of risk of their data getting leaked and of the impact a smart assistant has on a public good. The experiment considers two different domains, which include the identical 3 x 3 design but vary in the public good that will be promoted by the smart assistant.
Experimental Design Details
The experiment considers three different dependent variables. We investigate individuals' willingness to provide data (WPD), their perceived moral obligation to provide data (MO), and their relative willingness to provide data (rWPD) to different managing parties and different types of algorithm as variables of interest.

To identify the effect of the risk- and impact-treatments on the WPD, we use ANOVAs to examine whether we can reject the H0: No significant differences in WPD means of the three treatment groups (treatment level: low, middle, high) per treatment (risk/ impact). If so, we test our alternative hypotheses (direction of mean differences) using Tukey's method.

To identify if moral obligation mediates the effects of the risk- and impact-treatments on WPD, we perform a mediation analysis. We investigate to what extent the effects of the explanatory variables on WPD pass through MO in our baseline specification and if all conditions for a mediation are met.

To identify differences in the WPD for a database operated by academia, the private industry, and the government, we use an ANOVA to examine whether we can reject the H0: No significant differences in means of the willingness to provide data to each database managed by academia, the private industry or the government. If so, we test our alternative hypotheses (direction of mean differences) using Tukey's method. We further perform follow-up regression analyses to test, how the risk- and impact-treatments affect rWPD means per managing party considering various control variables.

To identify differences in the WPD for a database that enables a smart assistant based on a self-learning or human-supervised algorithm, we use a t-test to examine whether we can reject the H0: No significant differences in means of the willingness to provide data to each database that enables a smart assistant based on a self-learning or human-supervised algorithm. We further perform follow-up regression analyses to test, how the risk- and impact-treatments affect rWPD means per type of algorithm considering various control variables.

Our hypotheses will be tested with US American citizens. For the experimental sample, all participants will be recruited from the crowdworking platform Amazon Mechanical Turk (MTurk). The online study is aimed exclusively at workers who are US citizens and over 18 years old. These characteristics are ensured by a notification in the survey description on MTurk and by additional queries during the study.
Randomization Method
Participants are randomly assigned to the treatments by a designated function by the software Unipark. Randomization is carried out with a random number trigger. I.e. each treatment group is assigned a number, the trigger then selects a random number for each participant, according to which the participants are assigned to the corresponding groups with the identical numbers.
Randomization Unit
Randomization will be done on the participant level.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
No clusters.
Sample size: planned number of observations
Details: The sample size has been calculated using G*Power (http://gpower.hhu.de) and considers the statistical test (ANOVA), type of analysis (a priori), alpha level (0.05), failed attention check rates (10%), and drop-out rates (35%). In step one, participants are randomly assigned to one of two domains (sustainable health system / sustainable environment). In step two, in line with the 3 x 3 design participants are randomly assigned to one of nine treatment combinations per domain: risk (low, middle, high) x impact (low, middle, high). Thus, there are nine treatment combinations per domain. According to G*Power 105 individuals need to participate in each treatment combination. In step three, participants of each treatment combinations are randomly assigned to either answer a question about the managing parties or the type of algorithm, which results in a total of 36 treatment combinations. According to the G*Power test, each group requires an average sample size of 52.3 participants. Hence, the required overall sample size is at least (36 x 52.3=) 1,883 participants. Although natural persons, numbers have not been rounded when extrapolating to required total sample size as high deviations would occur.
Sample size (or number of clusters) by treatment arms
The respective sample sizes by treatment arm are reported for one of the two domains we consider in the experiment (sustainable health system / sustainable environment). The sample sizes for the other domain are perfectly identical.

Risk low / Impact low: n = 105, Risk low / Impact middle: n = 105, Risk low / Impact high: n = 105, Risk middle / Impact low: n = 105, Risk middle / Impact middle: n = 105, Risk middle / Impact high: n = 105, Risk high / Impact low: n = 105, Risk high / Impact middle: n = 105, Risk high / Impact high: n = 105.

Managing Parties (MP): Risk low / Impact low: n = 53, MP Risk low / Impact middle: n = 53, MP Risk low / Impact high: n = 53, MP Risk middle / Impact low: n = 53, MP Risk middle / Impact middle: n = 53, MP Risk middle / Impact high: n = 53, MP Risk high / Impact low: n = 53, MP Risk high / Impact middle: n = 53, MP Risk high / Impact high: n = 53.

Type of Algorithm (TA): Risk low / Impact low: n = 53, TA Risk low / Impact middle: n = 53, TA Risk low / Impact high: n = 53, TA Risk middle / Impact low: n = 53, TA Risk middle / Impact middle: n = 53, TA Risk middle / Impact high: n = 53, TA Risk high / Impact low: n = 53, TA Risk high / Impact middle: n = 53, TA Risk high / Impact high: n = 53.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Estimated effect (based on Cohen, 1992): middle / d=0.5; Power: 0.80
Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
IRB

Institutional Review Boards (IRBs)

IRB Name
Ethics Committee of the University of Bremen, Germany
IRB Approval Date
2020-06-11
IRB Approval Number
2020-05

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials