Converging "Truths"? Digital Knowledge Platforms, Language Skills and Informational Barriers.

Last registered on February 21, 2025

Pre-Trial

Trial Information

General Information

Title
Converging "Truths"? Digital Knowledge Platforms, Language Skills and Informational Barriers.
RCT ID
AEARCTR-0015060
Initial registration date
December 19, 2024

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
February 21, 2025, 6:44 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Primary Investigator

Affiliation
Universidade Nova de Lisboa, SBE

Other Primary Investigator(s)

PI Affiliation
Boston College

Additional Trial Information

Status
Withdrawn
Start date
2024-02-01
End date
2027-08-01
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract

Abstract: Recent advances in generative AI and LLMs have made it possible to compare large text corpora across languages. In this paper, we leverage these tools to study language barriers and polarization on online knowledge platforms. Specifically, we study over 160,000 Wikipedia articles on 104 historic conflicts across 15 different languages. We analyze how (dis-)similarities in their various language versions evolve over time by comparing their translations and edit histories. We do so by leveraging translation capabilities of LLMs and generative AI to analyze article similarity across languages and over time. First results confirm an (expected) home bias: specific wars or battles are better covered in languages of involved parties than by uninvolved parties. Moreover, we see significant dissimilarity in contents, particularly between the winning and the losing parties. This well-documented historic bias diminishes as language and information become more interlinked. In further preliminary results on text similarity of translated and embedded articles we see evidence that “winners write history” — articles in uninvolved languages are more like those of conflict winners and less like those of conflict losers. Ongoing work highlights language barriers to the dissemination of “truth” and how translation can reduce (historical) misinformation and polarization in digital knowledge platforms.
Keywords: Open Knowledge Platforms, AI Translation, LLMs & NLP, Flow of Information, Translation Skills
External Link(s)

Registration Citation

Citation
Kummer, Michael and Sebastian Steffen. 2025. "Converging "Truths"? Digital Knowledge Platforms, Language Skills and Informational Barriers.." AEA RCT Registry. February 21. https://doi.org/10.1257/rct.15060-1.0
Experimental Details

Interventions

Intervention(s)
We will measure cosine distances between Wikipedia articles' embeddings ' both across articles in different languages on the same topic as well as among different versions of the same article in the same language.

Intervention Start Date
2026-04-01
Intervention End Date
2027-05-01

Primary Outcomes

Primary Outcomes (end points)
Cosine Similarity Differential between treated language versions of the same article.
Primary Outcomes (explanation)
We will derive multilingual LLM embedding vectors from the rawtext of selected Wikipedia articles and compute their cosine distance.

Secondary Outcomes

Secondary Outcomes (end points)
Editing (activity) and convergence between articles (measured as a growth in cosine similarity)
Secondary Outcomes (explanation)
Activity is a classic outcome in the area of user-generated content. Speed of convergence, rather than distance, is an important higher order measure of reductions in content divergence, and hence a useful novel platform outcome.

Experimental Design

Experimental Design
We will identify suitable cnadidate articles that are characterized by high cosine difference between the focal article and the average article and then randomly treat a suitable fraction of these articles with a high-visibility banner that flags the uniqueness.
Experimental Design Details
Not available
Randomization Method
Computer
Randomization Unit
wikipdata_id (topic) - language with a holdout-sample of completely untreated topics.
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
250 battles and control topics.
Sample size: planned number of observations
approx. 250,000 to 500,000 -- it depends on the observation density
Sample size (or number of clusters) by treatment arms
300 wikipedia topics controls, approx. 150 conflict topics untreated, 150 conflict topics treated.
however. because each language pair is a unit of measurement, and each topic has on average 40 language versions, we will obtain 39 distance measures on average per treated article, and thus (calculating with a conservative 30) 4500 treated language-distance observations that result from the 150 treatment arms.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Assuming a baseline conversion rate of less than 5% annually, the smallest detectable lift is a lift of 1.5%age points, at 85% power. This is conservative, because WIkipedia is now very stable and the annual convergence rate is expected to be BELOW 5%. Moreover, if the treatment induces convergence, then the rate of convergence will shift much more than 1.5 percentage points.
Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
IRB

Institutional Review Boards (IRBs)

IRB Name
IRB Approval Date
IRB Approval Number
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information