Experimental Design Details
We investigate the performance effects of framing performance rating scales when ratings are used to determine wage payments. Accordingly, the performance rating scale is identical across treatments but the framing of the scale varies. In all treatments, the computer evaluates employees’ performance on a 3-point scale. Actual ratings are based on pre-determined, absolute performance benchmarks and follow exactly the same procedure in all three treatments such that only categories 1-3 are actually awarded. Subjects do not learn the specific details of the ratings procedure. We base our performance benchmarks for assigning bonus categories on the performance distribution of Berger et al. 2013 and use the 30%, 40%, 30% rule of the initial framed field experiments.
We analyze three treatment variations: Treatment “NoDummy” (ND) serves as a baseline. Subjects see the actual 3-point rating scale used by the computer. In treatment “Dummy” (D), subjects see a 4-point scale (where in fact the computer only uses the highest three categories). The scale shown in treatment “DummyTransparent” (DT) is similar to the one in treatment “Dummy”. However, the non-usage of the added rating category is disclosed.
In the first part of our investigation of unused rating categories in performance appraisals, we focused on the individual performance rating. For more details on our framed field experiments, see the pre-registered trials: https://doi.org/10.1257/rct.3029-1.0 and https://doi.org/10.1257/rct.2736-1.0. We hypothesized that adding an unused lower rating category increases performance due to two reasons: First, adding a lower rating category increases the bonus spread and hence the incentive to exert effort. Accordingly, this should increase performance (incentive effect). Second, adding an unused lower rating category increases the relative individual performance evaluation. This should induce positively reciprocal effort provision and hence should also increases performance (evaluation effect).
Surprisingly, we found evidence of a reversed, negative performance effect of an unused lower rating category, which we attribute to the kindness of rating scales. Adding an (unused) lower rating category reduces the kindness of a rating scale, which induces negatively reciprocal reactions reducing performance. Contrary to the initial hypotheses focusing on the individual performance rating, the kindness of an incentive scheme (rating scale) seems to have an overall dominant performance influence.
We now do a robustness check and test whether this “kindness of a scale effect” replicates in a laboratory study. Compared to the field environment of our first experiments, we expect the following changes due to the laboratory conditions: First, our experimental design emphasizes the scale choice. Accordingly, this should amplify the kindness of a scale effect. Second, due to the cleaner, laboratory conditions we expect the incentive effect to be stronger, while at the same time the reciprocal reactions should be less strong (no workplace setting). Moreover, we now analyze the performance effects over time as we employ several working rounds.
As a result, we test the following hypotheses.
H1: Average performance in treatment DT is greater than average performance in treatment D.
H2: Average performance in treatment ND is greater than average performance in treatment D.
H3: Average performance in treatment DT is greater than average performance in treatment ND.
We test H1-H3 over all periods as well as per period.
We expect the differences described in H1-H3 to decrease over the working periods, as subjects become used to the incentive scale.
H4: On average, subjects assess the rating scale used in treatment DT as kinder than the rating scale shown in treatment D.
H5: On average, subjects assess the rating scale used in treatment ND as kinder than the rating scale shown in treatment D.
H6: On average, subjects assess the rating scale used in treatment DT as kinder than the rating scale shown in treatment ND.
We introduce the three treatments ND, D and DT using the following mechanism: Three employee groups named A, B and C are randomly matched to an employer. The employer sees three different (framings of) evaluation scales as described in the treatments ND, D and DT above. Employer’s task is to assign each of the three evaluation scales to one of the three employee groups. Other information on the employee groups than the group names A, B, C is not provided. The three groups have equal pre-round performance and group names are assigned randomly. W thereby nudge employers into randomly assigning evaluation scales to employee groups and ensure exogenous treatment variation.
After the main task, subjects answer a questionnaire on the kindness of the assigned rating scales, a reciprocity measure, a Big 5 questionnaire and demographic questions.