Motivating metadata contributions for data re-use and reproducibility

Last registered on May 16, 2025

View Trial History

Pre-Trial

Trial Information

General Information

Title

Motivating metadata contributions for data re-use and reproducibility

RCT ID

AEARCTR-0006159

Initial registration date

August 02, 2020

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

August 03, 2020, 12:30 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated

May 16, 2025, 11:03 AM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Country

United States of America

Region

Primary Investigator

Name

Yan Chen

Affiliation

University of Michigan

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Linfeng Li

PI Affiliation

University of Michigan

Contact Investigator

PI Name

Lars Vilhuber

PI Affiliation

Cornell University

Contact Investigator

PI Name

Margaret Levenstein

PI Affiliation

University of Michigan

Contact Investigator

Additional Trial Information

Status

Completed

Start date

2020-08-03

End date

2020-09-30

Keywords

Other

Additional Keywords

metadata, digital public goods

JEL code(s)

H49

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

Metadata is data about the data. It is introduced for archival purposes but now serves a more active role: machine-readable metadata is crucial for making the documented studies "findable" by modern search engines. We investigate different motivational treatments for soliciting contribution to metadata. In our experiment, we approach the authors who published papers with online data sets and ask them to provide study-level metadata for the studies that they have published. Our investigation will also shed light on the crowdsourcing approach to supply high-quality metadata at scale.

External Link(s)

Registration Citation

Citation

Chen, Yan et al. 2025. "Motivating metadata contributions for data re-use and reproducibility." AEA RCT Registry. May 16. https://doi.org/10.1257/rct.6159-1.1

Sponsors & Partners

Interventions

Intervention(s)

Participants will receive personalized emails generated using different templates corresponding to different experimental conditions, all of which encourage them to contribute metadata to their published studies.

Intervention (Hidden)

Subjects will receive personalized emails generated using three templates corresponding to the three experimental conditions. In the baseline template, we describe the task and invite subjects to provide metadata in a customized survey interface. In treatment 1 and treatment 2, we insert one extra paragraph into the baseline template that explains the findability implication of metadata.
Treatment 1 and treatment 2 are different in how the new paragraph concludes: in treatment 1, the individual benefit of future citations is made salient; in treatment 2, potential benefits to graduate students and other researchers are made salient.

Intervention Start Date

2020-08-03

Intervention End Date

2020-09-30

Primary Outcomes

Primary Outcomes (end points)

Individual's willingness to participate;
Article-level metadata contribution.

Primary Outcomes (explanation)

We measure outcomes based on the quantity and quality of the metadata content. Subjects fill in the metadata fields for their assigned article (their own paper) through a Supplemental Metadata interface, distributed through a unique link imbedded in the treatment email.

Willingness to participate is recorded on an individual basis. In the body of the email, subjects choose to participate or to opt-out completely. We count those who are willing to proceed to our survey interface as willing to participate.

The other primary outcome is captured at the article level, as a count of the number of metadata fields that are populated by all subjects who belong to the article.

Secondary Outcomes

Secondary Outcomes (end points)

willingness to directly update metadata; self-reported motivation for contributing to metadata

Secondary Outcomes (explanation)

The metadata for articles in this experiment is stored on a centralized platform, where currently, none of the subjects have editing access to it. Through the survey interface, subjects can request direct access and supply more detailed metadata along with updated datasets.

We ask subjects to self-report their motivation through a checklist after they have finished editing the metadata fields. More details are documented in the Analysis Plan document.

Experimental Design

Subjects will receive a personalized email inviting them to provide metadata for one of their published studies. We introduce one control message and two variants of treatment message. In the baseline template, we describe the task and invite subjects to provide metadata in a customized survey interface. In treatment 1 and treatment 2, we insert one extra paragraph into the baseline template that explains the findability implication of metadata. Treatment 1 and treatment 2 are different in how the new paragraph concludes.

Experimental Design Details

Starting in October 2019, AEA migrated the entire back archive of 3,073 data and
code supplements into the AEA Data and Code Repository hosted at openICPSR.
Currently, the metadata fields for the migrated deposits are sparse, making the
migrated studies hard to find.

Subjects in our experiment are organized through a coauthor-network, where we
allow for multiple edges between individuals if they have coauthored multiple
papers together. In total, there are 3,070 articles and 4,321 unique
authors. The giant component of this network has 2,004 subjects. To mitigate
spillovers through the coauthor-network, we selectively include a portion of the
articles into this experiment, where none of the two picked articles shall
have any author of the articles in common. That is, we choose to include a set
of independent articles.

We have one control condition and two treatment conditions. In the control
condition, we describe the task and invite subjects to provide metadata through
a customized survey interface. In treatment 1 and treatment 2, we insert one
extra paragraph into the email template used in the control condition, which
explains the findability implication of metadata. Treatment 1 and treatment 2
are different in how the new paragraph concludes: in treatment 1, the individual
benefit of future citations is made salient; in treatment 2, potential benefits
to graduate students and other researchers are made salient.

Randomization Method

Randomization was done in office by a computer.

Relying on the network of coauthors, we extract a set of independent articles such that none of the articles we include in the experiment shall have any common authors. In practice, we take out "bridging" edges from the original network and generate a reduced network with fewer edges. We randomly pick one article from each component of the network in the reduced network and proceed to the randomization step.

As the articles are chosen from components in the reduced network, we block by the "origin" of the components as well as the year of publication and number of authors for the articles. Within each block, we reshuffle the list and stack the list of articles in the same block on top of another list of articles from another block. Lastly, in the master list where blocks are stacked on top of each other, we divide the index by 3 and assign experimental conditions based on the residual from the division.

Randomization Unit

We cluster treatment at the article level, where all authors of the same article are assigned to the same experimental condition.

Was the treatment clustered?

Yes

Experiment Characteristics

Sample size: planned number of clusters

1,458 articles

Sample size: planned number of observations

3,044 participants

Sample size (or number of clusters) by treatment arms

Each treatment arm has 486 clusters (articles).

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

10 percentage points

Supporting Documents and Materials

Documents

Document Name

Experiment Interface Screenshots

Document Type

other

Document Description

This document captures the screenshots of the experiment interface.

File

Experiment Interface Screenshots

MD5: 68d0d86e90f8f5a51e6151e3965f740e

SHA1: c144a979e3b8d14cec7dd1b01159ef3b379fc6a4

Uploaded At: July 31, 2020

IRB

Institutional Review Boards (IRBs)

IRB Name

Health Sciences and Behavioral Sciences Institutional Review Board (IRB-HSBS)

IRB Approval Date

2020-06-01

IRB Approval Number

HUM00178056

Analysis Plan

Analysis Plan Documents

Experiment Design and Pre-Analysis Plan

MD5: f59fcb7fcbe155add5dd981fb167406b

SHA1: 2d687b86dca5410eded8a9a4ab94127ee9c208fa

Uploaded At: August 02, 2020

Post-Trial

Post Trial Information

Study Withdrawal

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Intervention

Is the intervention completed?

Data Collection Complete

Data Publication

Is public data available?

Program Files

Reports, Papers & Other Materials

Motivating metadata contributions for data re-use and reproducibility

Pre-Trial

General Information

Locations

Primary Investigator

Other Primary Investigator(s)

Additional Trial Information

Registration Citation

Sponsors

Interventions

Primary Outcomes

Secondary Outcomes

Experimental Design

Experiment Characteristics

Documents

Institutional Review Boards (IRBs)

Analysis Plan Documents

Post-Trial

Study Withdrawal

Intervention

Data Publication

Program Files

Relevant Paper(s)

Reports & Other Materials