On Rating Scales in Subjective Performance Evaluations - Performance Effects of Kindness

Last registered on October 18, 2019

View Trial History

Pre-Trial

Trial Information

General Information

Title

On Rating Scales in Subjective Performance Evaluations - Performance Effects of Kindness

RCT ID

AEARCTR-0004385

Initial registration date

October 18, 2019

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

October 18, 2019, 10:42 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

Germany

Region

Primary Investigator

Name

Thomas Vogt

Affiliation

University of Cologne

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Tobias Stangl

PI Affiliation

University of Cologne

Contact Investigator

PI Name

Dirk Sliwka

PI Affiliation

University of Cologne

Contact Investigator

PI Name

Ulrich Thonemann

PI Affiliation

University of Cologne

Contact Investigator

Additional Trial Information

Status

In development

Start date

2019-10-21

End date

2020-06-30

Keywords

Labor

Additional Keywords

Performance Appraisals, Incentives, Kindness, Reciprocity, Bonus Spread

JEL code(s)

Secondary IDs

Abstract

To incentivize employee performance, many companies employ performance appraisals that are tied to compensation. In such ratings, employees are typically evaluated on a specific predetermined scale. It is often observed that the lowest rating categories are unused resulting in compressed appraisals. So far, research on subjective evaluations often associated this behavior with rating bias. In this project however, we study the question whether compressed appraisals may be designed by companies intentionally adding unused lower categories in order to frame rating scales to avoid negatively reciprocal reactions of employees.

External Link(s)

Registration Citation

Citation

Sliwka, Dirk et al. 2019. "On Rating Scales in Subjective Performance Evaluations - Performance Effects of Kindness." AEA RCT Registry. October 18. https://doi.org/10.1257/rct.4385-1.0

Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Experimental Details

Interventions

Intervention(s)

Intervention (Hidden)

Intervention Start Date

2019-10-21

Intervention End Date

2019-12-31

Primary Outcomes

Primary Outcomes (end points)

Points for counting the number “7” in blocks of randomly generated numbers on the individual level (individual performance).
Points for counting the number “7” in blocks of randomly generated numbers on the group level (group performance).
The number of times the time-out button was used, i.e. leisure time was taken.
Questionnaire data (post)

Primary Outcomes (explanation)

Points for counting the number “7”:
For each correct answer, a subject receives two points, and for each wrong answer, 0.5 points are substracted.

Secondary Outcomes

Secondary Outcomes (end points)

Secondary Outcomes (explanation)

Experimental Design

We conduct a laboratory experiment that involves the role of “employers” and “employees” resembling a standard work setting. In all treatments, subjects in the role of employees are randomly matched into groups with one subject in the role of employers. Employees work on a real effort task and receive a performance dependent bonus. Employers benefit from higher work effort as their payment (piece rate) depends on their employees’ performance. Employers determine the rating scale shown to employees when their performance is evaluated by the computer. The experiment consists of two stages. In the first stage, employers determine the rating scales shown to their groups of employees. Subsequently, employees learn the rating scale chosen by their employer and work on a real effort task afterwards.

We assess individuals’ ability in a pre-round, in which all subjects work on the same real-effort task. Subjects need to count the number “7” in blocks of randomly generated numbers (Berger et al. 2013). Based on this ability test, best performing subjects become employers. The remaining subjects become employees and are randomly assigned to an employer. Matching is anonymous and participants never receive information on the identity of other subjects. Using stratified sampling, we ensure equal ex ante performance across employee groups and equal number of employees per employer. To assure understanding of the experimental instructions, subjects must pass a comprehension quiz. Before the main task, subjects learn their role (employer or employee). The main task consists of two consecutive stages. In the first stage, employers determine the rating scales of the performance appraisal shown to their employees. Subsequently, employees learn the rating scale chosen by their employer. In the second stage, employees work on the real effort task of the pre-round for six rounds, each lasting 2.5 minutes. After the main task follows a questionnaire section.

Experimental Design Details

We investigate the performance effects of framing performance rating scales when ratings are used to determine wage payments. Accordingly, the performance rating scale is identical across treatments but the framing of the scale varies. In all treatments, the computer evaluates employees’ performance on a 3-point scale. Actual ratings are based on pre-determined, absolute performance benchmarks and follow exactly the same procedure in all three treatments such that only categories 1-3 are actually awarded. Subjects do not learn the specific details of the ratings procedure. We base our performance benchmarks for assigning bonus categories on the performance distribution of Berger et al. 2013 and use the 30%, 40%, 30% rule of the initial framed field experiments.

We analyze three treatment variations: Treatment “NoDummy” (ND) serves as a baseline. Subjects see the actual 3-point rating scale used by the computer. In treatment “Dummy” (D), subjects see a 4-point scale (where in fact the computer only uses the highest three categories). The scale shown in treatment “DummyTransparent” (DT) is similar to the one in treatment “Dummy”. However, the non-usage of the added rating category is disclosed.

In the first part of our investigation of unused rating categories in performance appraisals, we focused on the individual performance rating. For more details on our framed field experiments, see the pre-registered trials: https://doi.org/10.1257/rct.3029-1.0 and https://doi.org/10.1257/rct.2736-1.0. We hypothesized that adding an unused lower rating category increases performance due to two reasons: First, adding a lower rating category increases the bonus spread and hence the incentive to exert effort. Accordingly, this should increase performance (incentive effect). Second, adding an unused lower rating category increases the relative individual performance evaluation. This should induce positively reciprocal effort provision and hence should also increases performance (evaluation effect).

Surprisingly, we found evidence of a reversed, negative performance effect of an unused lower rating category, which we attribute to the kindness of rating scales. Adding an (unused) lower rating category reduces the kindness of a rating scale, which induces negatively reciprocal reactions reducing performance. Contrary to the initial hypotheses focusing on the individual performance rating, the kindness of an incentive scheme (rating scale) seems to have an overall dominant performance influence.

We now do a robustness check and test whether this “kindness of a scale effect” replicates in a laboratory study. Compared to the field environment of our first experiments, we expect the following changes due to the laboratory conditions: First, our experimental design emphasizes the scale choice. Accordingly, this should amplify the kindness of a scale effect. Second, due to the cleaner, laboratory conditions we expect the incentive effect to be stronger, while at the same time the reciprocal reactions should be less strong (no workplace setting). Moreover, we now analyze the performance effects over time as we employ several working rounds.

As a result, we test the following hypotheses.
H1: Average performance in treatment DT is greater than average performance in treatment D.
H2: Average performance in treatment ND is greater than average performance in treatment D.
H3: Average performance in treatment DT is greater than average performance in treatment ND.

We test H1-H3 over all periods as well as per period.
We expect the differences described in H1-H3 to decrease over the working periods, as subjects become used to the incentive scale.

H4: On average, subjects assess the rating scale used in treatment DT as kinder than the rating scale shown in treatment D.
H5: On average, subjects assess the rating scale used in treatment ND as kinder than the rating scale shown in treatment D.
H6: On average, subjects assess the rating scale used in treatment DT as kinder than the rating scale shown in treatment ND.

We introduce the three treatments ND, D and DT using the following mechanism: Three employee groups named A, B and C are randomly matched to an employer. The employer sees three different (framings of) evaluation scales as described in the treatments ND, D and DT above. Employer’s task is to assign each of the three evaluation scales to one of the three employee groups. Other information on the employee groups than the group names A, B, C is not provided. The three groups have equal pre-round performance and group names are assigned randomly. W thereby nudge employers into randomly assigning evaluation scales to employee groups and ensure exogenous treatment variation.

After the main task, subjects answer a questionnaire on the kindness of the assigned rating scales, a reciprocity measure, a Big 5 questionnaire and demographic questions.

Randomization Method

Stratification method

Randomization Unit

Individual

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

none

Sample size: planned number of observations

480 subjects total sample size

Sample size (or number of clusters) by treatment arms

160 subjects in each treatment:
160 subjects NoDummy (control) treatment,
160 subjects Dummy treatment,
160 subjects DummyTransparent treatment
Note: due to possible no-shows of subjects the final number of subjects per treatment might be less

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

MDES= 2,5 [Points for counting the number “7”], standard deviation=6,99 [Points for counting the number “7”], 10%

Supporting Documents and Materials

IRB