|
Field
Trial Title
|
Before
Can Customer Reviews Reduce Statistical Discrimination? Implications for Online Marketplaces
|
After
The impact of qualitative reviews in online markets: Empirical and experimental evidence on statistical discrimination
|
|
Field
Abstract
|
Before
We investigate the role of reviews in statistical discrimination in the sharing economy (specifically in online rental markets). Using a controlled experiment in an Airbnb-like setting, we measure how quantitative and qualitative customer review information affects accommodation demand across minority and non-minority hosts. We create fictitious listings using scraped data from Airbnb and systematically vary host characteristics (through photos and names), the number of available customer reviews, and the informativeness and quality of the available reviews. Our experimental design consists of three between participants treatments: one treatment varying host race (minority/non-minority) and the number of reviews (few/many, keeping quality of reviews fixed), one varying host race and informativeness of reviews when all reviews are positive, and another treatment varying host race and review informativeness when the reviews include one negative. This approach allows us to isolate the specific mechanisms through which customer reviews influence statistical discrimination. Our findings will provide insights for platform design to reduce racial discrimination in the sharing economy, complementing existing observational studies on discriminatory behaviour in online markets.
|
After
We investigate the role of customer reviews and host demographics in statistical discrimination within the sharing economy (specifically in online rental markets). Using a controlled experiment in an Airbnb-like setting, we measure how a host's race, a host's gender, and customer reviews interact to affect accommodation demand. We create fictitious listings using scraped data from Airbnb and systematically vary host characteristics across three primary dimensions in a fully crossed 2x2x2 factorial design: Host Race (Black/White), Host Gender (Man/Woman), and a Review factor (High/Low). To isolate specific mechanisms of review-based discrimination, the exact nature of the Review factor varies across three between-participant treatments, manipulating either review quantity, positive informativeness, or negative informativeness. Our experimental design uses a forced-choice pairwise mechanism across three budget blocks (Low, Mid, High). To ensure perfect orthogonality and counterbalancing, the pairings are drawn from a comprehensive property map, with participants assigned to one of 56 block-randomised survey versions. This approach allows us to estimate the causal main effects of each attribute, as well as their interactions, to understand whether specific types of high-quality reviews can mitigate intersectional demographic penalties. Our findings will provide insights for platform design to reduce discrimination in the sharing economy.
|
|
Field
Trial Start Date
|
Before
June 01, 2025
|
After
March 30, 2026
|
|
Field
Trial End Date
|
Before
December 31, 2025
|
After
July 31, 2026
|
|
Field
Last Published
|
Before
May 27, 2025 07:04 AM
|
After
March 24, 2026 12:57 PM
|
|
Field
Intervention Start Date
|
Before
June 01, 2025
|
After
March 30, 2026
|
|
Field
Intervention End Date
|
Before
December 31, 2025
|
After
July 31, 2026
|
|
Field
Primary Outcomes (End Points)
|
Before
Ranking of target properties
|
After
Whether target property was chosen
|
|
Field
Primary Outcomes (Explanation)
|
Before
Each participant will be presented with 4 sets of 6 fictitious properties. In each set, there will be one target property that we will vary between participants according to the treatment they are in as detailed in our description of the treatments in the Experimental design section.
The participant’s ranking of the target property in each set (a number between 1 and 6) is our primary outcome.
|
After
Each participant will be presented with 33 pairs of fictitious properties across three budget blocks (Low, Mid, High) and asked to select their most preferred property in each pair. Out of these 33 rounds, 28 are the primary experimental rounds containing the fully counterbalanced 2x2x2 attribute variations. The remaining 5 rounds are fixed filler/attention-check rounds. For the analysis, the data from the 28 experimental rounds will be reshaped into a "long format," resulting in 56 observations per participant (two competing properties per round). Our primary outcome is a binary indicator (0 or 1) for whether a specific property variant was selected by the participant.
|
|
Field
Experimental Design (Public)
|
Before
General structure of the experiment
The experiment will run online on Prolific. After obtaining informed consent, participants will first report details about their most recent rental experience, which will be used to customize the price range of properties shown to them. The main task consists of four rounds, where in each round participants rank six fictitious properties in order of preference. To encourage participants to reveal how they believe others perceive the properties, participants are incentivised with bonus payments based on how closely their rankings align with the modal ranking (the most common ordering chosen by other participants). The experiment concludes with a post-experimental survey comprising an Implicit Association Test to measure implicit biases and basic demographic questions.
Treatments
Our experimental design consists of three treatments, each showing participants 4 sets of 6 fictitious properties. In each set, there will be one target property that we will vary between participants according to the treatment they are in. In the first treatment, we will vary the host race (minority/non-minority) and review quantity (low/high, keeping quality of reviews fixed) of such property. In the second treatment, we will vary the host race and informativeness of reviews (low/high, keeping number of reviews fixed) when all reviews are positive. In the third treatment, we will vary the host race and informativeness of reviews (low/high, keeping number of reviews fixed) when one of the reviews is negative. Participants are randomly assigned to one treatment and see each target property configuration exactly once, ensuring they cannot compare different versions of the same property. Within their assigned treatment, participants evaluate four different sets of properties, with the target property's characteristics systematically varied across sets.
Hypotheses
For all treatments, a benchmark hypotheses is that properties with minority hosts will receive lower rankings compared to identical properties with majority hosts. After establishing the existence of a ranking difference due to race, we are interested in each treatment to study the effect of reviews on this difference.
We hypothesize that:
i. Controlling for host characteristics and review quality, the quantity of reviews will affect participants' ranking.
ii. Controlling for host characteristics and review quantity, the informativeness of reviews will affect participants' ranking.
Hypothesis ii will be tested separately for treatments 2 and 3, so we can study how informativeness affects the ranking gap in the presence and absence of a negative review. This design allows us to isolate the effects of host race, review quantity, and review quality on property rankings while minimising potential confounds.
Analysis of main effects
We will run the following regressions for participants in the first and second treatments, respectively:
Prob(Rank_{ijt} ≤ k) = Λ(κₖ - β₀ + β₁Minority_i + β₂ LowReviews_i β₁₂(Minority_i x LowReviews_i) + γₚ + δₜ)
Where:
Rank_ijt is the ranking (1-6) given to property i by participant j in set t
Minority_i is a dummy variable indicating whether the host is a minority
LowReviews_i is a dummy for low quantity/informativeness of reviews (1 if low, 0 if high)
γₚ are participant fixed effects
δₜ are set fixed effects
This specification would test:
1) Whether minority hosts receive lower rankings: H1: β₁ < 0
2) Whether low quantity/quality of reviews leads to lower rankings: H2: β₂ < 0
Exploratory analysis
While not a main hypothesis, we implicitly assume that the baseline effect of minority host status (β₁) is consistent across both review quantity and quality treatments. This additional hypothesis could provide interesting insights about whether discrimination against minority hosts varies depending on the type of information (quantity vs. quality of reviews) being considered. Therefore, we will also test this hypothesis (H4) by comparing the coefficients across the two regressions using a statistical test (like a Chow test or z-test for equality of coefficients from separate regressions).
We also aim to investigate whether the main effects tested above (in H1, H2) interact with experimental variables. One plausible interaction effect would be that for non-minority hosts there is little or no statistical discrimination to start with, so higher number/quality of reviews does not change the ranking much, whereas for minority hosts, the effect may be stronger. We test this hypothesis separately for each treatment arm. For each of the treatment arms, this hypothesis is captured by:
H5: |β₂ + β₁₂| > |β₂|
- where β₂ represents the effect of low quantity/informativeness reviews for non-minority hosts, and
- (β₂ + β₁₂) represents the effect of low quantity/informativeness reviews for minority hosts
In other words, we expect the interaction terms (β₁₂) to be negative and significant, indicating that minority hosts are more heavily penalised for having few or low-quality reviews compared to non-minority hosts.
For the exploratory hypothesis, we will apply Benjamini-Hochberg corrections to exploratory hypotheses (H₄ – H₅) to control the false discovery rate at α = 0.10.
Robustness Checks
We will assess robustness by re-estimating models without random effects (clustering SEs at the participant level), including set-level random effects, and with different covariance structures.
We will also formally test the proportional odds assumption using a Brant test. If violated, we will consider partial proportional odds models or multinomial logistic regression as alternatives.
|
After
General structure of the experiment
The experiment will run online on Prolific. After obtaining informed consent, participants will first report details about their most recent rental experience. The main task consists of 33 rounds divided into three budget blocks. In each round, participants are presented with two fictitious properties and must choose the one they would prefer to rent. To encourage truthful revelation of preferences and mitigate social desirability bias, participants are incentivised with bonus payments based on how closely their choices align with the modal choice (the option most frequently selected by other participants). The experiment concludes with a post-experimental survey comprising an Implicit Association Test to measure implicit biases and standard demographic questions.
Treatments
Our experimental design consists of three between-participant treatments. In all treatments, we use a within-subjects orthogonal design where participants evaluate 28 pairs of properties across three primary dimensions.
• Treatment 1 (Quantity): We vary the host race (minority/non-minority), host gender (man/woman), and review quantity (low/high, keeping the quality of reviews fixed).
• Treatment 2 (Positive Informativeness): We vary the host race, host gender, and the informativeness of reviews (low/high, keeping the number of reviews fixed) when all reviews are positive.
• Treatment 3 (Negative Informativeness): We vary the host race, host gender, and the informativeness of reviews (low/high, keeping the number of reviews fixed) when one of the reviews is negative.
Participants are randomly assigned to one of the three treatments. To prevent participants from seeing the exact same property twice while ensuring that all combinations of traits are tested against each other, the 28 experimental rounds are constructed using a predefined "property map". To achieve perfect counterbalancing, participants within each treatment are randomly assigned to one of 56 distinct participant types. This rotation ensures that property attributes are orthogonal to the round number and budget tier, minimising order effects.
|
|
Field
Planned Number of Clusters
|
Before
5,250 (1,750 per treatment arm). Based on a pilot study, we may adjust the sample size up or down by up to 250 individuals.
|
After
1,200 individuals (400 per treatment arm)
|
|
Field
Planned Number of Observations
|
Before
21,000 (7,000 per treatment arm, 4 from each participant)
|
After
Because participants make 28 valid experimental choices evaluated in pairs, the effective number of observations is 400 * 28 * 2 = 22,400 observations per treatment (67,200 total).
|
|
Field
Sample size (or number of clusters) by treatment arms
|
Before
5,250 (1,750 per treatment arm). Based on a pilot study, we may adjust the sample size up or down by up to 250 individuals.
|
After
400 individuals per treatment
|
|
Field
Power calculation: Minimum Detectable Effect Size for Main Outcomes
|
Before
We simulate data following our model specification across a grid of parameter values, including the minority penalty coefficient, review quantity/quality effects, and interaction terms for different sample sizes. For each parameter combination, we generated 1,000 synthetic datasets to estimate statistical power for detecting discrimination effects.
With the chosen sample size, we should be able to pick up a minority penalty coefficient of at least 0.1 in magnitude for 80% power. We verified that smaller magnitudes of the minority penalty coefficient would lead to a standardised effect size that would not be economically meaningful.
|
After
We base our sample size on the hardest to detect main effect identified in our pilot (the demographic penalties, which showed a roughly 3.0% difference in selection rates). We simulate data using the 28-round, long-format structure and the group means/standard deviations from the pilot. By running the full LPM on synthetic datasets across a grid of sample sizes, we determine the number of participants required to achieve 90% statistical power to detect a main effect coefficient of 0.032 at α = 0.05 to be N=400 per treatment arm.
|