Determinants of Bias in Subjective Performance Evaluations

Last registered on November 13, 2019

View Trial History

Pre-Trial

Trial Information

General Information

Title

Determinants of Bias in Subjective Performance Evaluations

RCT ID

AEARCTR-0005020

Initial registration date

November 13, 2019

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

November 13, 2019, 4:44 PM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

United States of America

Region

Primary Investigator

Name

David Kusterer

Affiliation

Rotterdam School of Management, Erasmus University Rotterdam

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Dirk Sliwka

PI Affiliation

University of Cologne

Contact Investigator

Additional Trial Information

Status

In development

Start date

2019-11-18

End date

2019-12-31

Keywords

Firms & Productivity, Labor

Additional Keywords

Subjective Performance Evaluation, Leniency Bias, Centrality Bias, Ratings, Social Preferences

JEL code(s)

Secondary IDs

Abstract

We study the determinants of biases in subjective performance evaluations in an MTurk experiment. In the experiment subjects in the role of workers work on a real effort task. Subjects in the role of supervisors observe subsamples of the workers’ output and assess their performance. We conduct 6 experimental treatments varying (i) whether workers’ pay depends on the performance evaluation (ii) whether supervisors are paid for the accuracy of their evaluations and (iii) the precision of the information available to supervisors. We study the effects of these treatments on rating leniency (the extent to which ratings exceed true performance) and rating compression (determined by the extent to which ratings vary with observed information).

External Link(s)

Registration Citation

Citation

Kusterer, David and Dirk Sliwka. 2019. "Determinants of Bias in Subjective Performance Evaluations ." AEA RCT Registry. November 13. https://doi.org/10.1257/rct.5020-1.0

Sponsors & Partners

Experimental Details

Interventions

Intervention(s)

Intervention (Hidden)

Intervention Start Date

2019-11-18

Intervention End Date

2019-12-31

Primary Outcomes

Primary Outcomes (end points)

Performance Rating (0-100%).

Primary Outcomes (explanation)

Secondary Outcomes

Secondary Outcomes (end points)

- Belief of workers about performance (0-100%)
- Demographics
- Performance of supervisors on two example pages of real effort task (average number of correctly entered images on the two examples pages, 0-100%)
- Belief of supervisor about performance on two example pages (0-100%)
- SVO of supervisor towards part 1 subjects (angle between average payment to self and other)
- Big 5
- Reciprocity
- Demographics
- Satisfaction of worker with own performance (integer scale from 0-10)
- Satisfaction of worker with rating (integer scale from 0-10)
- SVO of worker towards part 2 subject who rated this subject (angle between average payment to self and other)

Secondary Outcomes (explanation)

Experimental Design

We conduct an online experiment on Amazon MTurk. Our experiment consists of three parts. Each part is finished before the next part starts. The individual parts are described in more detail below.
In part 1, subjects (called workers for the purpose of this document) work on a real effort task. After part 1 is completed, another set of subjects (called supervisors for the purpose of this document) receive noisy information about the performance of one of the workers from part 1 and rate their work. After part 2 is completed, subjects from part 1 are invited again to learn their rating. At the beginning of parts 1 and 2, participants agree to a consent form, read instructions, and answer comprehension questions.

Part 1: Entry Task
Workers perform a real effort task. The task consists of entering text contained in hard-to-read images (similar to so-called ‘captchas’). Workers see 10 pages with 10 images on each page. Each page has one of five different time limits: 17, 19, 21, 23, or 25 seconds. Each of these time limits occurs exactly twice in randomized order. The order is the same for all subjects. The time limit for the next page is announced during a 5-seconds countdown before the page starts. Workers are also asked their belief about their performance on all 10 pages.
There are no treatments in part 1. Workers learn in the instructions that their work will be rated by other MTurk worker(s) and that they will receive a payment which may depend on the rating they receive in part 2.
Workers fill in a demographics questionnaire after completing the Entry Task.

Part 2: Rating Task
First, supervisors work on two pages of the Entry Task to get familiar with the real effort task. One of the pages has the shortest time limit of 17 seconds while the other page has the longest time limit of 25 seconds. Supervisors are also asked about their belief about their performance on the two example pages.
Then, supervisors are matched to a random worker from part 1. Matching is anonymous and participants never receive information on the identity of other subjects. Supervisors receive a noisy signal about the number of correctly entered images by the matched worker and are asked to give a rating to the worker. They are told that the rating should reflect performance on all 10 pages of the Entry Task.
We vary whether supervisors are paid according to accuracy or not (A and NA), whether workers are paid according to the rating or not (R and NR) and whether supervisors observe 1 or 4 pages out of the 10 pages the workers have worked on (S1 and S4). We conduct altogether 6 treatments:
• A-R-S1
• NA-R-S1
• A-R-S4
• NA-R-S4
• A-NR-S1
• NA-NR-S1
We randomly assign workers (and hence matched supervisors) to the six treatments stratifying the assignment to obtain comparable performance distributions across treatments.
In all treatments, supervisors receive a payment that is increasing in their matched worker’s actual performance. At the end of part 2, supervisors complete the Social Value Orientation (SVO) slider measure (Murphy et al., 2011) with a random worker (but not the one they rated) as the recipient in order to measure their social preferences towards the worker population. They also fill in a reciprocity questionnaire (Dohmen et al., 2009), the Big Five Inventory (Rammstedt and John, 2005), and a demographics questionnaire. After this, they learn their total payment and leave the study.

Part 3
Workers who completed part 1 and who received a rating in part 2 are invited per email to participate in part 3.
First, they learn their rating and their actual performance and are asked to submit their satisfaction with their performance and their satisfaction with the rating. Second, they learn whether their payment depends on the rating, and learn their payment. They complete the SVO slider measure with their supervisor as the recipient to measure their social preferences towards their supervisor. After this, they learn their total payment consisting of payments from the Entry Task, the SVO they completed, and the SVO another supervisor completed in part 2. This concludes the experiment.

Exclusion criteria
We restrict participation to MTurk workers who have completed at least 1000 HITs (Human Interface Task) on MTurk and who have an approval rate of at least 98%. These restrictions are standard in the literature and ensure high data quality. Subjects are excluded from payment (and further participation in the study) if they do not enter a single image in part 1.
In parts 1 and 2, participants have to enter comprehension questions to make sure they understand the instructions. If they do not answer a question correctly after the third attempt they are excluded from further participation.
Before participants agree to participate in the study in part 1, they are made aware that they will only receive their payment if they also participate in the third part of the study within 4 weeks of receiving the invitation email.

Pre-Analysis Plan
We will regress performance ratings on treatment dummies, the standardized aggregated signal observed by the respective supervisor, and interaction terms between the signal and the respective treatment dummies. The treatment dummies thus capture between treatment differences in rating leniency and the respective interaction terms differences in rating compression. Furthermore, we will compare the average ratings between prosocial and individualistic supervisors (according to SVO) within treatments.

References
T. Dohmen, A. Falk, D. Huffman, and U. Sunde. Homo reciprocans: Survey evidence on behavioural outcomes. Economic Journal, 119(536):592–612, 2009.
R. O. Murphy, K. A. Ackermann, and M. J. J. Handgraaf. Measuring social value orientation. Judgment and Decision Making, 6(8):771–781, 2011.
B. Rammstedt and O. P. John. Kurzversion des Big Five Inventory (BFI-K): Entwicklung und Validierung eines ökonomischen Inventars zur Erfassung der fünf Faktoren der Persönlichkeit. Diagnostica, 51(4):195– 206, 2005.

Experimental Design Details

Randomization Method

Randomization by computer, stratified by performance to obtain comparable performance distributions across treatments

Randomization Unit

Individual

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

No clustering

Sample size: planned number of observations

780 worker observations and 780 supervisor observations 130 groups (1 worker and 1 supervisor) per treatment. Due to attrition, it is possible that we have a lower number of participants in part 3. We will document dropouts and test whether these are systematic (ex., based on performance). As our main measurement is the rating in part 2 compared to the performance from part 1, even systematic attrition will not bias our main results.

Sample size (or number of clusters) by treatment arms

- A-R-S1: 130 workers and 130 supervisors
- NA-R-S1: 130 workers and 130 supervisors
- A-R-S4: 130 workers and 130 supervisors
- NA-R-S4: 130 workers and 130 supervisors
- A-NR-S1: 130 workers and 130 supervisors
- NA-NR-S1: 130 workers and 130 supervisors

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Assuming a standard deviation in ratings of 29 obtained in a pilot study with 15 workers and 15 supervisors, a power of 0.8 and a significance level of 0.05, our MDES is a 10.11 percentage points difference in ratings.

Supporting Documents and Materials

IRB