The Social Dilemma of Big Data

Last registered on September 14, 2020

View Trial History

Pre-Trial

Trial Information

General Information

Title

The Social Dilemma of Big Data

RCT ID

AEARCTR-0006241

Initial registration date

September 13, 2020

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

September 14, 2020, 7:38 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

United States of America

Region

Primary Investigator

Name

Kirsten Hillebrand

Affiliation

University of Bremen

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Lars Hornuf

PI Affiliation

University of Bremen

Contact Investigator

Additional Trial Information

Status

In development

Start date

2020-09-13

End date

2020-09-25

Keywords

Environment & Energy, Governance, Health, Welfare

Additional Keywords

Decision-making, Sustainable development, Cooperative games, Data privacy, Data philanthropy

JEL code(s)

C71, H41, I18, O35, Q56

Secondary IDs

Abstract

Decision-support systems can influence people in various domains of life. Firms have started implementing these systems via chatbots and other natural language-based assistants. While benefiting from these decision-support systems, individuals provide sensitive and valuable data to the private industry. By using this data, companies may generate additional profits. Moreover, by making their data available, individuals may also promote the common good. Regulators have recently called for efficient tools to let the public collectively benefit from its own data. The study investigates how people may voluntarily upload their data to a database that is used to develop and operate a decision-support system to promote a public good.

External Link(s)

Registration Citation

Citation

Hillebrand, Kirsten and Lars Hornuf. 2020. "The Social Dilemma of Big Data." AEA RCT Registry. September 14. https://doi.org/10.1257/rct.6241-1.0

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

Intervention (Hidden)

We investigate how (a) the risk of a data leak (3 levels), (b) the impact of a chatbot on a sustainable health system (3 levels), and (c) the impact of a chatbot on a sustainable environment (3 levels) affect people's willingness to share their personal data. Intervention (a) is either combined with intervention (b) or intervention (c) in a 3x3 design. A sustainable health system and a sustainable environment constitute two different domains in which the chatbot can have an impact.

Intervention Start Date

2020-09-13

Intervention End Date

2020-09-25

Primary Outcomes

Primary Outcomes (end points)

Individual willingness to upload personal data to a database that is used to develop and operate a decision-support system to promote a public good.

Primary Outcomes (explanation)

Participants indicate their willingness to provide data (WPD) to a database on a 1-100 slider. The variable is the response to the following question: "How inclined are you to upload your personal data to the database?" Scales: 0% = Not at all likely; 100% = Extremely likely.

Secondary Outcomes

Secondary Outcomes (end points)

(1) Moral obligation to provide data
(2) Relative willingness to provide data, depending on who is the managing party of the database
(3) Relative willingness to provide data, depending on the decision-support system's underlying type of algorithm
(4) Differences in willingness to provide data depending on the public good to be promoted (sustainable health system / sustainable environment)

Secondary Outcomes (explanation)

(1) Participants from each treatment combination indicate their moral obligation to provide data to a database on a 5 point Likert scale. The variable is the mean response to the following two questions:
1: "Do you think that there is a moral obligation for people to upload their personal data to the database?" Scales: 1 = It would be wrong for people to upload their personal data to the database; 3 = People don’t have to upload their personal data to the database, but it would be nice if they did; 5 = People must upload their personal data to the database.
2: "How morally wrong is it if people do not upload their personal data to the database?" Scales: 1 = Perfectly fine; 3 = Neither fine, nor wrong; 5 = Deeply wrong.

(2) + (3) Participants indicate their relative willingness to provide data (rWPD) to a database on 1-100 sliders that have to sum up to 100 points in total. The variable is the response to the following question: "How inclined are you to upload your personal data to each database?" Scales: 0% = Not at all likely; 100% = Extremely likely. WPD and rWPD rely on a similar question item. However, they differ in their respective answering options. For WPD participants respond using a single slider. To identify the rWPD variable participants will have to use multiple sliders in relation to each other. The sum of these sliders must equal 100. Hence, indicating the individual willingness to provide data to one of the databases negatively correlates with their willingness to provide data to the alternative database.

(4) In both domains, sustainable health system and sustainable environment, participants indicate their willingness to provide data to a database on a 1-100 slider. The variable is the response to the following question: "How inclined are you to upload your personal data to the database?" Scales: 0% = Not at all likely; 100% = Extremely likely.

Experimental Design

We plan an online study with experimental manipulations that rely on between- and within subject designs. The experimental treatments are followed by an online survey to control for confounding variables and characteristics. The experiment follows a 3 x 3 design. The participants will randomly be confronted with varying levels of risk of their data getting leaked and of the impact a smart assistant has on a public good. The experiment considers two different domains, which include the identical 3 x 3 design but vary in the public good that will be promoted by the smart assistant.

Experimental Design Details

The experiment considers three different dependent variables. We investigate individuals' willingness to provide data (WPD), their perceived moral obligation to provide data (MO), and their relative willingness to provide data (rWPD) to different managing parties and different types of algorithm as variables of interest.

To identify the effect of the risk- and impact-treatments on the WPD, we use ANOVAs to examine whether we can reject the H0: No significant differences in WPD means of the three treatment groups (treatment level: low, middle, high) per treatment (risk/ impact). If so, we test our alternative hypotheses (direction of mean differences) using Tukey's method.

To identify if moral obligation mediates the effects of the risk- and impact-treatments on WPD, we perform a mediation analysis. We investigate to what extent the effects of the explanatory variables on WPD pass through MO in our baseline specification and if all conditions for a mediation are met.

To identify differences in the WPD for a database operated by academia, the private industry, and the government, we use an ANOVA to examine whether we can reject the H0: No significant differences in means of the willingness to provide data to each database managed by academia, the private industry or the government. If so, we test our alternative hypotheses (direction of mean differences) using Tukey's method. We further perform follow-up regression analyses to test, how the risk- and impact-treatments affect rWPD means per managing party considering various control variables.

To identify differences in the WPD for a database that enables a smart assistant based on a self-learning or human-supervised algorithm, we use a t-test to examine whether we can reject the H0: No significant differences in means of the willingness to provide data to each database that enables a smart assistant based on a self-learning or human-supervised algorithm. We further perform follow-up regression analyses to test, how the risk- and impact-treatments affect rWPD means per type of algorithm considering various control variables.

Our hypotheses will be tested with US American citizens. For the experimental sample, all participants will be recruited from the crowdworking platform Amazon Mechanical Turk (MTurk). The online study is aimed exclusively at workers who are US citizens and over 18 years old. These characteristics are ensured by a notification in the survey description on MTurk and by additional queries during the study.

Randomization Method

Participants are randomly assigned to the treatments by a designated function by the software Unipark. Randomization is carried out with a random number trigger. I.e. each treatment group is assigned a number, the trigger then selects a random number for each participant, according to which the participants are assigned to the corresponding groups with the identical numbers.

Randomization Unit

Randomization will be done on the participant level.

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

No clusters.

Sample size: planned number of observations

Details: The sample size has been calculated using G*Power (http://gpower.hhu.de) and considers the statistical test (ANOVA), type of analysis (a priori), alpha level (0.05), failed attention check rates (10%), and drop-out rates (35%). In step one, participants are randomly assigned to one of two domains (sustainable health system / sustainable environment). In step two, in line with the 3 x 3 design participants are randomly assigned to one of nine treatment combinations per domain: risk (low, middle, high) x impact (low, middle, high). Thus, there are nine treatment combinations per domain. According to G*Power 105 individuals need to participate in each treatment combination. In step three, participants of each treatment combinations are randomly assigned to either answer a question about the managing parties or the type of algorithm, which results in a total of 36 treatment combinations. According to the G*Power test, each group requires an average sample size of 52.3 participants. Hence, the required overall sample size is at least (36 x 52.3=) 1,883 participants. Although natural persons, numbers have not been rounded when extrapolating to required total sample size as high deviations would occur.

Sample size (or number of clusters) by treatment arms

The respective sample sizes by treatment arm are reported for one of the two domains we consider in the experiment (sustainable health system / sustainable environment). The sample sizes for the other domain are perfectly identical.

Risk low / Impact low: n = 105, Risk low / Impact middle: n = 105, Risk low / Impact high: n = 105, Risk middle / Impact low: n = 105, Risk middle / Impact middle: n = 105, Risk middle / Impact high: n = 105, Risk high / Impact low: n = 105, Risk high / Impact middle: n = 105, Risk high / Impact high: n = 105.

Managing Parties (MP): Risk low / Impact low: n = 53, MP Risk low / Impact middle: n = 53, MP Risk low / Impact high: n = 53, MP Risk middle / Impact low: n = 53, MP Risk middle / Impact middle: n = 53, MP Risk middle / Impact high: n = 53, MP Risk high / Impact low: n = 53, MP Risk high / Impact middle: n = 53, MP Risk high / Impact high: n = 53.

Type of Algorithm (TA): Risk low / Impact low: n = 53, TA Risk low / Impact middle: n = 53, TA Risk low / Impact high: n = 53, TA Risk middle / Impact low: n = 53, TA Risk middle / Impact middle: n = 53, TA Risk middle / Impact high: n = 53, TA Risk high / Impact low: n = 53, TA Risk high / Impact middle: n = 53, TA Risk high / Impact high: n = 53.

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Estimated effect (based on Cohen, 1992): middle / d=0.5; Power: 0.80

Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

IRB