Motivating metadata contributions for data re-use and reproducibility

Last registered on August 03, 2020

Pre-Trial

Trial Information

General Information

Title
Motivating metadata contributions for data re-use and reproducibility
RCT ID
AEARCTR-0006159
Initial registration date
August 02, 2020

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
August 03, 2020, 12:30 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
University of Michigan

Other Primary Investigator(s)

PI Affiliation
University of Michigan
PI Affiliation
Cornell University
PI Affiliation
University of Michigan

Additional Trial Information

Status
In development
Start date
2020-08-03
End date
2020-09-30
Secondary IDs
Abstract
Metadata is data about the data. It is introduced for archival purposes but now serves a more active role: machine-readable metadata is crucial for making the documented studies "findable" by modern search engines. We investigate different motivational treatments for soliciting contribution to metadata. In our experiment, we approach the authors who published papers with online data sets and ask them to provide study-level metadata for the studies that they have published. Our investigation will also shed light on the crowdsourcing approach to supply high-quality metadata at scale.
External Link(s)

Registration Citation

Citation
Chen, Yan et al. 2020. "Motivating metadata contributions for data re-use and reproducibility." AEA RCT Registry. August 03. https://doi.org/10.1257/rct.6159-1.0
Sponsors & Partners

Sponsors

Experimental Details

Interventions

Intervention(s)
Participants will receive personalized emails generated using different templates corresponding to different experimental conditions, all of which encourage them to contribute metadata to their published studies.
Intervention Start Date
2020-08-03
Intervention End Date
2020-09-30

Primary Outcomes

Primary Outcomes (end points)
Individual's willingness to participate;
Article-level metadata contribution.
Primary Outcomes (explanation)
We measure outcomes based on the quantity and quality of the metadata content. Subjects fill in the metadata fields for their assigned article (their own paper) through a Supplemental Metadata interface, distributed through a unique link imbedded in the treatment email.

Willingness to participate is recorded on an individual basis. In the body of the email, subjects choose to participate or to opt-out completely. We count those who are willing to proceed to our survey interface as willing to participate.

The other primary outcome is captured at the article level, as a count of the number of metadata fields that are populated by all subjects who belong to the article.

Secondary Outcomes

Secondary Outcomes (end points)
willingness to directly update metadata; self-reported motivation for contributing to metadata
Secondary Outcomes (explanation)
The metadata for articles in this experiment is stored on a centralized platform, where currently, none of the subjects have editing access to it. Through the survey interface, subjects can request direct access and supply more detailed metadata along with updated datasets.

We ask subjects to self-report their motivation through a checklist after they have finished editing the metadata fields. More details are documented in the Analysis Plan document.

Experimental Design

Experimental Design
Subjects will receive a personalized email inviting them to provide metadata for one of their published studies. We introduce one control message and two variants of treatment message. In the baseline template, we describe the task and invite subjects to provide metadata in a customized survey interface. In treatment 1 and treatment 2, we insert one extra paragraph into the baseline template that explains the findability implication of metadata. Treatment 1 and treatment 2 are different in how the new paragraph concludes.
Experimental Design Details
Starting in October 2019, AEA migrated the entire back archive of 3,073 data and
code supplements into the AEA Data and Code Repository hosted at openICPSR.
Currently, the metadata fields for the migrated deposits are sparse, making the
migrated studies hard to find.

Subjects in our experiment are organized through a coauthor-network, where we
allow for multiple edges between individuals if they have coauthored multiple
papers together. In total, there are 3,070 articles and 4,321 unique
authors. The giant component of this network has 2,004 subjects. To mitigate
spillovers through the coauthor-network, we selectively include a portion of the
articles into this experiment, where none of the two picked articles shall
have any author of the articles in common. That is, we choose to include a set
of independent articles.

We have one control condition and two treatment conditions. In the control
condition, we describe the task and invite subjects to provide metadata through
a customized survey interface. In treatment 1 and treatment 2, we insert one
extra paragraph into the email template used in the control condition, which
explains the findability implication of metadata. Treatment 1 and treatment 2
are different in how the new paragraph concludes: in treatment 1, the individual
benefit of future citations is made salient; in treatment 2, potential benefits
to graduate students and other researchers are made salient.
Randomization Method
Randomization was done in office by a computer.

Relying on the network of coauthors, we extract a set of independent articles such that none of the articles we include in the experiment shall have any common authors. In practice, we take out "bridging" edges from the original network and generate a reduced network with fewer edges. We randomly pick one article from each component of the network in the reduced network and proceed to the randomization step.

As the articles are chosen from components in the reduced network, we block by the "origin" of the components as well as the year of publication and number of authors for the articles. Within each block, we reshuffle the list and stack the list of articles in the same block on top of another list of articles from another block. Lastly, in the master list where blocks are stacked on top of each other, we divide the index by 3 and assign experimental conditions based on the residual from the division.
Randomization Unit
We cluster treatment at the article level, where all authors of the same article are assigned to the same experimental condition.
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
1,458 articles
Sample size: planned number of observations
3,044 participants
Sample size (or number of clusters) by treatment arms
Each treatment arm has 486 clusters (articles).
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
10 percentage points
Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
IRB

Institutional Review Boards (IRBs)

IRB Name
Health Sciences and Behavioral Sciences Institutional Review Board (IRB-HSBS)
IRB Approval Date
2020-06-01
IRB Approval Number
HUM00178056
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?
No
Data Collection Complete
Data Publication

Data Publication

Is public data available?
No

Program Files

Program Files
Reports, Papers & Other Materials

Relevant Paper(s)

Reports & Other Materials