Welfare Gains of Online Rating Systems

Last registered on May 17, 2023

View Trial History

Pre-Trial

Trial Information

General Information

Title

Welfare Gains of Online Rating Systems

RCT ID

AEARCTR-0011386

Initial registration date

May 10, 2023

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

May 17, 2023, 2:14 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

United States of America

Region

Primary Investigator

Name

Noah Bohren

Affiliation

HEC Lausanne (UNIL)

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Rustamdjan Hakimov

PI Affiliation

HEC Lausanne (UNIL)

Contact Investigator

PI Name

Luis Santos Pinto

PI Affiliation

HEC Lausanne (UNIL)

Contact Investigator

Additional Trial Information

Status

In development

Start date

2023-05-10

End date

2023-07-31

Keywords

Behavior, Lab, Welfare

Additional Keywords

Online Ratings

JEL code(s)

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

We study the efficacy of rating systems, similar to the one used in online marketplaces. In this study, we investigate the effectiveness of rating systems in assisting consumers with product selection, focusing on both vertical and horizontal markets. Additionally, we explore the impact of grouping ratings and implementing freezing periods on welfare in horizontally differentiated markets. Our research aims to answer the following questions:

1. Does the introduction of a rating system increase welfare in vertically differentiated markets?
2. Does the introduction of a rating system increase welfare in horizontally differentiated markets?
3. Does providing average reviews broken down by relevant individual characteristics (filtering) increase welfare in horizontally differentiated markets?
4. Does the introduction of a “freezing period” (the average rating is disclosed only when a minimum amount of ratings are available) improve welfare?

External Link(s)

Registration Citation

Citation

Bohren, Noah, Rustamdjan Hakimov and Luis Santos Pinto. 2023. "Welfare Gains of Online Rating Systems." AEA RCT Registry. May 17. https://doi.org/10.1257/rct.11386-1.0

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

## Sketch of Design:

We conduct an online experiment on the Prolific platform to study the efficacy of various online rating systems in vertically and horizontally differentiated markets.

We create artificial marketplaces where the goods participants can buy are tasks that they can perform. The tasks vary and determine whether a market is vertically or horizontally differentiated. Each participant is endowed with £2.5 and decides whether to buy one of two tasks or not. The two tasks are quizzes about celebrities that consist of 10 questions with 6 possible answers. Each task lasts 110 seconds. Participation in Task 1 and 2 costs £1.7 but rewards £0.45 per correct answer. If they choose not to buy a task, participants spend 110 seconds counting zeros in the matrices task, with no incentives.

We measure welfare as the average monetary outcome of participants.

Vertically Differentiated Markets:

In vertically differentiated markets, Task 1 is set such that it is easier than Task 2 for most participants. We call Task 1 “easy” and Task 2 “hard”.

e.g. *“Among all elected US American presidents, what was the name of the first who had African origin?”* VS *“Which of those actors won the most Oscars for acting?”*

Horizontally Differentiated Markets:

In horizontally differentiated markets, Task 1 is set such that it is easier than Task 2 for only half of the participants. To generate such variation, we recruit two types of participants: people between 18-30 years old and 50+ years old. We call Task 1 “young” and Task 2 “old”.

e.g. *"Who is the celebrity with the most subscribers on YouTube?”* VS *"In which TV- show did the character of JR Ewing appear?”*

If participants buy a task, once they complete it, participants have the possibility (but not the obligation) to give a rating (1 to 5 stars) by answering the following question: *"Please give us your opinion on the task you just participated in. This information can be helpful for future participants”*

We create 6 treatments that vary the type of market (vertical or horizontal) and which rating system is used (none, standard, filtered, frozen)

### Key stages

We define two key stages of the experiment.

1. At the **buying stage,** participants are informed of:

(a) The prices of each task as well as the payment scheme.

(b) A description of each task. Tasks 1 and 2 are both described as a quiz on celebrities. The outside option of not buying is described as having to count zeros in tables.

(c) Average (and number) of ratings for Task 1 and 2. (Whether this information is available or not and how it is disclosed depends on treatments)

2. At the **rating stage,** participants of Task 1 and 2 receive their scores and have an opportunity to rate the tasks.

An important consideration in our design is the evolution of ratings. When ratings are available during the buying stage (all but the baseline treatments), the observations become interdependent. This means that early ratings might influence the entire path of rating developments. To ensure several independent histories of ratings, we create 14 groups per treatment. In each group, 30 participants arrive sequentially and observe the ratings of the previous participants. Having independent groups allows us to observe potential different paths of rating evolution, which, in turn, allows us to measure whether there is a large variance in the welfare gains in the markets.

Intervention Start Date

2023-05-10

Intervention End Date

2023-07-31

Primary Outcomes

Primary Outcomes (end points)

Average monetary outcome of the last 15 participants in a sequence.

### Hypothesis

1. The introduction of a rating system increases welfare in vertically differentiated markets.
→ Average monetary outcome (welfare) will be higher in “Rating Vertical” compared to “Baseline Vertical” for the last 15 participants of each sequence (rating is established).

2. The introduction of a rating system does not increase welfare in horizontally differentiated markets.
→ Average monetary outcome (welfare) will not be different between “Rating Horizontal” and “Baseline Horizontal” for the last 15 participants of each sequence (rating is established).

3. The introduction of a “filtering” policy improves welfare and stability for horizontally differentiated goods for the last 15 participants of each sequence (rating is established).
4. The introduction of a “freezing” policy improves welfare and stability for horizontally differentiated goods for the last 15 participants of each sequence (rating is established).

Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)

**Exploratory**
1. Participants with higher self-confidence in celebrity quizzes are more likely to buy a task than participants with low self-confidence.
2. Participants with higher self-confidence will give lower ratings compared to others for a given monetary outcome.
3. The variance in welfare between independent markets is higher in “Rating horizontal” compared to “Rating vertical”
4. The variance in welfare between independent markets is lower in “Frozen rating horizontal” and “Filtered rating horizontal” compared to “Rating horizontal”

Secondary Outcomes (explanation)

Experimental Design

### Treatments

We create two **baseline** treatments. One vertical (Task 1 is easy and Task 2 is hard) and one horizontal (Task 1 is young and Task 2 is old). For each, we recruit 200 participants who select a task and give a rating. The average rating from previous participants is never disclosed. This allows us to create a baseline and see how much welfare participants are able to generate **without** a rating system. (Note, we do not need participants to arrive sequentially since ratings are never displayed)

We then create two **rating** treatments where the average rating from previous participants is available at the buying stage. For the reasons described above, for each treatment, we generate 14 groups of 30 participants who enter the marketplace sequentially and where the average ratings are updated at every passing. This allows us to see how much additional welfare is generated by the introduction of a rating system.

We believe that rating systems are less effective when goods are horizontally differentiated. For this reason, we study two variations of rating systems that are used by some online platforms and could help improve welfare for the horizontal markets.

1. Filtering: This treatment is similar to “Rating Horizontal” but rather than displaying the average (and number) of ratings from previous participants, it is the average (and number) of ratings for each type (18-30 and 50+) that is displayed.
2. Freezing: This treatment is similar to “Rating Horizontal” but each task needs at least 5 ratings for the average (and number) or ratings to be disclosed.

### Additional exercises

In addition, for all participants and before the buying stage, we measure self-confidence in performance in celebrity quizzes. Participants have to answer the following question: *Imagine taking a quiz about celebrities. Out of 100 randomly selected people who also took the same quiz, how many do you think would perform worse than you?* This allows us to explore the effect of initial expectations on buying decisions and ratings.

Finally, we also measure risk aversion by asking the following “Are you generally a person who is fully prepared to take risks or do you try to avoid taking risks? [Scale from 0 to 10]]”.

Experimental Design Details

Randomization Method

To determine in which group a participant is assigned, we use a computer to randomise the process (for treatments).

Randomization Unit

We have 2 groups of 200 participants each (Baseline)
we have 4 treatments each composed of 14 groups of 30 participants.

Was the treatment clustered?

Yes

Experiment Characteristics

Sample size: planned number of clusters

2 groups of 200 participants, each independent observation
4*14 groups of 30 participants in sequence, each group is an independent observation

Sample size: planned number of observations

2080 participants

Sample size (or number of clusters) by treatment arms

6 treatments:
2 baseline with 200 participants each
1 horizontal rating with 14 groups of 30 participants (sequences)
1 vertical rating with 14 groups of 30 participants (sequences)
1 horizontal filtered rating with 14 groups of 30 participants (sequences)
1 horizontal frozen rating with 14 groups of 30 participants (sequences)

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Supporting Documents and Materials

IRB