Welfare Gains of Online Rating Systems

Last registered on November 20, 2024

Pre-Trial

Trial Information

General Information

Title
Welfare Gains of Online Rating Systems
RCT ID
AEARCTR-0011386
Initial registration date
May 10, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
May 17, 2023, 2:14 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated
November 20, 2024, 7:14 AM EST

Last updated is the most recent time when changes to the trial's registration were published.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
HEC Lausanne (UNIL)

Other Primary Investigator(s)

PI Affiliation
HEC Lausanne (UNIL)
PI Affiliation
HEC Lausanne (UNIL)

Additional Trial Information

Status
On going
Start date
2023-05-10
End date
2025-01-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
We study the efficacy of rating systems, similar to the one used in online marketplaces. In this study, we investigate the effectiveness of rating systems in assisting consumers with product selection, focusing on both vertical and horizontal markets. Additionally, we explore the impact of grouping ratings and implementing freezing periods on welfare in horizontally differentiated markets. Our research aims to answer the following questions:

1. Does the introduction of a rating system increase welfare in vertically differentiated markets?
2. Does the introduction of a rating system increase welfare in horizontally differentiated markets?
3. Does providing average reviews broken down by relevant individual characteristics (filtering) increase welfare in horizontally differentiated markets?
4. Does the introduction of a “freezing period” (the average rating is disclosed only when a minimum amount of ratings are available) improve welfare?

***************
New treatments for November 2024:

AT1: Does the introduction of an algorithm based recommender system increase welfare in horizontally differentiated markets? How does it compare to filtering?
AT2: Does the combination of a "classical" rating system (as implemented in the earlier treatments) and the algorithm based recommender system increase welfare in the horizontal market and does the effect decrease over time?
External Link(s)

Registration Citation

Citation
Bohren, Noah, Rustamdjan Hakimov and Luis Santos Pinto. 2024. "Welfare Gains of Online Rating Systems." AEA RCT Registry. November 20. https://doi.org/10.1257/rct.11386-2.0
Experimental Details

Interventions

Intervention(s)
## Sketch of Design:

We conduct an online experiment on the Prolific platform to study the efficacy of various online rating systems in vertically and horizontally differentiated markets.

We create artificial marketplaces where the goods participants can buy are tasks that they can perform. The tasks vary and determine whether a market is vertically or horizontally differentiated. Each participant is endowed with £2.5 and decides whether to buy one of two tasks or not. The two tasks are quizzes about celebrities that consist of 10 questions with 6 possible answers. Each task lasts 110 seconds. Participation in Task 1 and 2 costs £1.7 but rewards £0.45 per correct answer. If they choose not to buy a task, participants spend 110 seconds counting zeros in the matrices task, with no incentives.

We measure welfare as the average monetary outcome of participants.

Vertically Differentiated Markets:

In vertically differentiated markets, Task 1 is set such that it is easier than Task 2 for most participants. We call Task 1 “easy” and Task 2 “hard”.

e.g. *“Among all elected US American presidents, what was the name of the first who had African origin?”* VS *“Which of those actors won the most Oscars for acting?”*

Horizontally Differentiated Markets:

In horizontally differentiated markets, Task 1 is set such that it is easier than Task 2 for only half of the participants. To generate such variation, we recruit two types of participants: people between 18-30 years old and 50+ years old. We call Task 1 “young” and Task 2 “old”.

e.g. *"Who is the celebrity with the most subscribers on YouTube?”* VS *"In which TV- show did the character of JR Ewing appear?”*



If participants buy a task, once they complete it, participants have the possibility (but not the obligation) to give a rating (1 to 5 stars) by answering the following question: *"Please give us your opinion on the task you just participated in. This information can be helpful for future participants”*

We create 6 treatments that vary the type of market (vertical or horizontal) and which rating system is used (none, standard, filtered, frozen)

### Key stages

We define two key stages of the experiment.

1. At the **buying stage,** participants are informed of:

(a) The prices of each task as well as the payment scheme.

(b) A description of each task. Tasks 1 and 2 are both described as a quiz on celebrities. The outside option of not buying is described as having to count zeros in tables.

(c) Average (and number) of ratings for Task 1 and 2. (Whether this information is available or not and how it is disclosed depends on treatments)

2. At the **rating stage,** participants of Task 1 and 2 receive their scores and have an opportunity to rate the tasks.

An important consideration in our design is the evolution of ratings. When ratings are available during the buying stage (all but the baseline treatments), the observations become interdependent. This means that early ratings might influence the entire path of rating developments. To ensure several independent histories of ratings, we create 14 groups per treatment. In each group, 30 participants arrive sequentially and observe the ratings of the previous participants. Having independent groups allows us to observe potential different paths of rating evolution, which, in turn, allows us to measure whether there is a large variance in the welfare gains in the markets.



****************
Additional treatments: November 2024


To further explore the effectiveness of different systems in horizontally differentiated markets, we introduce two additional treatments: an algorithm-based recommendation and a combined rating and algorithm system. These treatments build on the existing experimental framework to examine how personalized information and rating systems interact in shaping welfare outcomes.
Intervention Start Date
2023-05-10
Intervention End Date
2025-01-31

Primary Outcomes

Primary Outcomes (end points)
Average monetary outcome of the last 15 participants in a sequence.

### Hypothesis

1. The introduction of a rating system increases welfare in vertically differentiated markets.
→ Average monetary outcome (welfare) will be higher in “Rating Vertical” compared to “Baseline Vertical” for the last 15 participants of each sequence (rating is established).

2. The introduction of a rating system does not increase welfare in horizontally differentiated markets.
→ Average monetary outcome (welfare) will not be different between “Rating Horizontal” and “Baseline Horizontal” for the last 15 participants of each sequence (rating is established).

3. The introduction of a “filtering” policy improves welfare and stability for horizontally differentiated goods for the last 15 participants of each sequence (rating is established).
4. The introduction of a “freezing” policy improves welfare and stability for horizontally differentiated goods for the last 15 participants of each sequence (rating is established).



************************
Additional treatments: November 2024

Treatment 1:
Hypothesis: Participants using algorithmic recommendations will achieve higher earnings compared to the baseline horizontal, rating, and filtering treatments. The algorithm is expected to offer personalized and accurate predictions that increase welfare without requiring participants to wait for ratings to accumulate.

Predictions:
1. Participants will earn more money in this treatment compared to the ”Baseline Horizontal”, “Rating Horizontal” and ”Filtered Rating Horizontal” treatments.

2. The average earnings for participants will be higher than those of the last 15 participants in the filtering treatment, as the algorithm offers a more efficient decision making tool by providing tailored information.


Treatment 2:
Hypothesis: Participants exposed to both the rating system and the algorithm will earn more than those in the ”Baseline Horizontal” and ”Rating Horizontal” treatments, as the algorithm provides personalized recommendations. However, participants may exhibit algorithm aversion and prefer to follow the standard ratings, particularly late in the sequence, which may reduce their potential earnings. We also hypothesize that as the number of ratings increases, participants will increasingly rely on ratings rather than the algorithm, leading to a decline in welfare over time.

Predictions:
1.Participants will earn more than those in the ”Baseline Horizontal” and ”Rating
Horizontal” treatments, as the algorithm should add value beyond the rating system.

2. Participants may not reach the same earnings as in Treatment 1 due to algorithm aversion, especially late in the sequence (last 15).

3. As more ratings accumulate, we expect participants to rely increasingly on the rating system, leading to a decline in average earnings over time as participants deviate from the algorithm’s more accurate recommendations.
Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)
**Exploratory**
1. Participants with higher self-confidence in celebrity quizzes are more likely to buy a task than participants with low self-confidence.
2. Participants with higher self-confidence will give lower ratings compared to others for a given monetary outcome.
3. The variance in welfare between independent markets is higher in “Rating horizontal” compared to “Rating vertical”
4. The variance in welfare between independent markets is lower in “Frozen rating horizontal” and “Filtered rating horizontal” compared to “Rating horizontal”

**********
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
### Treatments

We create two **baseline** treatments. One vertical (Task 1 is easy and Task 2 is hard) and one horizontal (Task 1 is young and Task 2 is old). For each, we recruit 200 participants who select a task and give a rating. The average rating from previous participants is never disclosed. This allows us to create a baseline and see how much welfare participants are able to generate **without** a rating system. (Note, we do not need participants to arrive sequentially since ratings are never displayed)

We then create two **rating** treatments where the average rating from previous participants is available at the buying stage. For the reasons described above, for each treatment, we generate 14 groups of 30 participants who enter the marketplace sequentially and where the average ratings are updated at every passing. This allows us to see how much additional welfare is generated by the introduction of a rating system.

We believe that rating systems are less effective when goods are horizontally differentiated. For this reason, we study two variations of rating systems that are used by some online platforms and could help improve welfare for the horizontal markets.

1. Filtering: This treatment is similar to “Rating Horizontal” but rather than displaying the average (and number) of ratings from previous participants, it is the average (and number) of ratings for each type (18-30 and 50+) that is displayed.
2. Freezing: This treatment is similar to “Rating Horizontal” but each task needs at least 5 ratings for the average (and number) or ratings to be disclosed.

### Additional exercises

In addition, for all participants and before the buying stage, we measure self-confidence in performance in celebrity quizzes. Participants have to answer the following question: *Imagine taking a quiz about celebrities. Out of 100 randomly selected people who also took the same quiz, how many do you think would perform worse than you?* This allows us to explore the effect of initial expectations on buying decisions and ratings.

Finally, we also measure risk aversion by asking the following “Are you generally a person who is fully prepared to take risks or do you try to avoid taking risks? [Scale from 0 to 10]]”.



**********************
## Additional treatments: November 2024

#Treatment 1: Algorithm-Based Recommendation

In this treatment, we introduce an ‘algorithm’ designed to recommend the most profitable option based on participants’ individual characteristics. The algorithm is trained on data from previous participants in this experiment. The algorithm is very simple as it will suggest participants to buy the age matching task (i.e. task old for old participants and task young for young participants). The algorithm’s suggestion will be presented as: ”Based on previous performance of participants similar to you, we suggest you buy task [age group].”

#Procedure:
Participants first complete the same introductory questionnaire as in previous treatments (including demographic information and confidence). During the buying stage, they are presented with the algorithm’s recommendation, which is tailored to their specific characteristics. Participants then choose a task or opt to count zeros, complete the chosen task, rate it, and exit the experiment.

#Sample:
We recruit 200 participants for this treatment (100 old and 100 young). Each
participant makes a decision independently (without sequential interdependence of
ratings).


#Treatment 2: Combined Rating and Algorithm-

This treatment combines the existing ”Rating Horizontal” system with the algorithm-based recommendation. Participants are provided with both the average ratings of previous participants (without filtering) and the algorithm’s recommendation during the buying stage.

#Procedure:
As in previous treatments, participants answer the introductory questionnaire and are exposed to the buying stage, where they see both the algorithm’s recommendation and the average rating (along with the number of ratings) from previous participants. After making their choice, they complete the task, rate it, and exit the experiment.

This treatment follows the same 14x30 sequential structure as in the ”Rating Horizontal” treatment, with ratings evolving across groups of participants. In each sequence, 15 old and 15 young participants will be randomly assigned a position in the sequence.

#Sample:
We recruit 14 groups of 30 participants (total 420 participants) who enter the
marketplace sequentially and observe the ratings of previous participants.
Experimental Design Details
Not available
Randomization Method
To determine in which group a participant is assigned, we use a computer to randomise the process (for treatments).
Randomization Unit
We have 2 groups of 200 participants each (Baseline)
we have 4 treatments each composed of 14 groups of 30 participants.

****************
Additional treatments: November 2024

We have 1 group of 200 participant
we have 1 treatment of 14 groups of 30 participants.
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
2 groups of 200 participants, each independent observation
4*14 groups of 30 participants in sequence, each group is an independent observation

********************
Additional treatments: November 2024

1 group of 200 participants, each independent observation
14 groups of 30 participants in sequence, each group is an independent observation
Sample size: planned number of observations
2080 participants ******************** Additional treatments: November 2024 +620 participants
Sample size (or number of clusters) by treatment arms
6 treatments:
2 baseline with 200 participants each
1 horizontal rating with 14 groups of 30 participants (sequences)
1 vertical rating with 14 groups of 30 participants (sequences)
1 horizontal filtered rating with 14 groups of 30 participants (sequences)
1 horizontal frozen rating with 14 groups of 30 participants (sequences)


********************
Additional treatments: November 2024

2 treatments:
1 Algorithm-Based Recommendation with 200 participants
1 Combined Rating and Algorithm Based Recommendation with 14 groups of 30 participants (sequences)
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
IRB

Institutional Review Boards (IRBs)

IRB Name
Ethics commission of HEC Lausanne
IRB Approval Date
2023-05-02
IRB Approval Number
RAILER