What Makes An Expert: An Experiment on Delegation of Agency

Last registered on March 19, 2025

Pre-Trial

Trial Information

General Information

Title
What Makes An Expert: An Experiment on Delegation of Agency
RCT ID
AEARCTR-0015539
Initial registration date
March 10, 2025

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
March 19, 2025, 8:45 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
The Ohio State University

Other Primary Investigator(s)

Additional Trial Information

Status
In development
Start date
2025-03-16
End date
2025-08-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
In today's information environment, individuals both struggle to discern views of accepted experts and give less weight to the opinions of those traditionally regarded as experts. Whether it is due to overconfidence, lack of transparency, conflicting information, or other factors, the distrust of experts is growing. Regardless of the cause, expertise is now subjectively valued and lacking consensus. We investigate one possible subjective definition of an expert. Using a laboratory experiment, we estimate a heterogeneous ``expertise gap" -- the difference between one's own accuracy on a task and the necessary accuracy of another person to give up agency over one's choice. This individual-specific measure captures the degree of superiority one must demonstrate over another to be considered an expert through delegation of agency. We also examine how variations in the information an expert possesses, which should not affect the expertise gap, influence the threshold for expertise.
External Link(s)

Registration Citation

Citation
Stelnicki, Samantha. 2025. "What Makes An Expert: An Experiment on Delegation of Agency." AEA RCT Registry. March 19. https://doi.org/10.1257/rct.15539-1.0
Experimental Details

Interventions

Intervention(s)
The control group is when the participant has the same information as the first-phase participant. Then our treatment groups are those that vary the information that the first-phase participant has in comparison with the main treatment individual. These treatments are within-subject information interventions aimed at investigating how the expertise gap changes with varying information. Within-subject, participants will see 7 different types of the tasks and report their necessary accuracy to delegate agency to another participant. These information interventions include the baseline (where all information is the same), an intervention where the other participant has more information, an intervention where the other participant has less information, an intervention where the other participant has more informative information but the same amount, an intervention where the other participant has less informative information but the same amount, an intervention where participants agreed on the answer for the task, and an intervention where the participants disagreed on the answer for the task.

We also determine whether a participant's own objective accuracy on the task changes the expertise gap. In order to determine this, we have two treatments. In the first treatment, we ask participants to report their own believed accuracy on the task and do not tell them their actual accuracy. In the second treatment, we have participants report their own accuracy, but also tell them their actual accuracy after reporting their subjective accuracy on the task. This is to determine if the expertise gap evolves with one's own accuracy or if it's determined by some thresholding rule that is not dependent on knowing your own accuracy.

We have three different versions of the task to test for robustness across task parameters. These versions differ in the distribution of colored marbles in the jars for the task. The different versions are between-subject treatments. These are not necessarily interventions to change behavior, but rather to ensure that the task parameters themselves aren't driving any behavior we find.

Additionally, we have a secondary experiment that has interventions based on demographic characteristics. These interventions are determined based on whether another participant has the same race/ethnicity or gender as the individual. We have two main interventions. The first is that the participants have the same demographic characteristic (either race/ethnicity or gender) and the same is when the participants do not have the same demographic characteristic.
Intervention Start Date
2025-03-16
Intervention End Date
2025-08-31

Primary Outcomes

Primary Outcomes (end points)
Our main variable of interest is the expertise premium, or the difference between a decision maker's reported necessary accuracy to delegate their agency to the first-phase participant and a decision maker's own accuracy. One's own accuracy is simply the number of questions out of 10 that the decision maker answered correctly, which is a lottery where the probability of winning is some multiple of 10. The decision maker's belief about their own accuracy is a percentage from 0% to 100%, which is a subjective lottery where their belief is the chance they win a bonus. If the decision maker instead chooses the accuracy of the first-phase participants to determine their bonus, they receive a bonus X out of 100 times, which is a lottery with X% chance of winning a bonus. X is the number of questions that the other participant must have answered correctly. Therefore, a rational decision-maker who satisfies first order stochastic dominance, should have a switch point where X% is equal to their own objective/subjective accuracy or their own objective/subjective accuracy + 1%. This strategy choice ensures the highest probability for a decision maker to win a bonus in the experiment.

If a decision maker has a preference for agency, we should see the expertise premium larger than 1%. Our first hypothesis test of interest is to determine if the average expertise premium is greater than 1%. We will test for this difference for the original, non-manipulated version for each difficulty level and treatment separately. This is a total of six hypothesis tests, each for a different sample of participants. We will use a one-sample Wilcoxon signed-rank test to determine if the value is greater than 1%.

Next, within each treatment, we are interested in whether the expertise premium (EP) across difficulty levels is significantly different for the original, non-manipulated problem. To do so, we will test whether the average expertise premium is the same across the three difficulty levels for each treatment. We will use the Kruskal-Wallis test to determine if the three groups differ. If they do not significantly differ, we will pool the data for each treatment for use in all future analysis.

We are also interested in whether knowing one's own objective accuracy changes the necessary accuracy for delegation to first-phase participants. Specifically, we are interested in whether the necessary accuracy is dependent on one's own perceived accuracy or if instead it is some sort of thresholding rule. For example, a decision maker might have some expectation for another person's performance in order to give up agency, regardless of their own performance. We are interested in both whether there's some constant thresholding heuristic, regardless of own accuracy, and if there is instead a consistent additional accuracy above one's own accuracy that is expected of the first-phase participants. This is two-part. First, we determine whether the average reported necessary accuracy for first-phase participants differs across the Subjective Accuracy (SA) and Objective Accuracy (OA) treatments in order to determine if a constant threshold rule is used. We will also test whether the expertise premiums differ, to determine if there's instead an expected additional accuracy first-phase participants need, dependent on decision maker's own accuracy. These two tests tell us whether the necessary accuracy reacts to decision makers knowing their own objective accuracy. We will use Mann-Whitney U (Wilcoxon rank-sum) tests to determine the differences for these two statistics across the two treatments. If the difficulty levels are not pooled from the statistical test above, this will be done separately for each difficulty level.

The previous hypothesis tests all determine whether in the non-manipulated treatment, a expertise gap exists. If so, we also learn how the expertise changes with information about the decision maker's own accuracy. The following analysis determines whether the expertise gap can be made greater or smaller by changing the task only for the first-phase participant compared to the original task performance of the decision maker. For each of the six manipulations (more information, less information, more informative, less informative, agree, and disagree), we will compare the size of the expertise gap in the manipulation to the size of the expertise gap in the original version of the task. From this, we can conclude how each manipulation changes the expertise gap, and give guidance as to how changing the characteristics of the first-phase participants' questions can lead to more rational decision making.

For each manipulation, we will compare the expertise gap to the original question expertise gap using a paired Wilcoxon signed-rank test, since the manipulation and original expertise gaps are within-subject. For each subject, their own accuracy stays constant, so this essentially tells us whether the necessary accuracy of the first-phase participant is increasing or decreasing for each manipulation. Depending on the previous analysis, we will pool difficulty level and treatments. If the previous analysis finds differences, we will perform a paired Wilcoxon signed-rank test for each difficulty level and/or treatment.

The incentives and payment structure are the same as in the main experiment. Similarly, the analysis is also a subset of the analysis of the main experiment. We will run the same hypothesis test for variation between manipulations. We will include the data from the main experiment for decision makers in the medium difficulty, Objective Accuracy treatment as a baseline comparison. We expect that similarity among decision makers and first-phase participants will lead to smaller expertise gaps. Additionally, we are interested in the magnitude difference between gender and race/ethnicity group. Specifically, we are interested in whether gender or race/ethnicity group similarity has a larger impact on expertise gap. Moreover, whether one of the demographic groups is more affected by similarity than others. To do so, we will use a regression that looks at the difference between the expertise gaps for similar demographic groups EP_similar and different demographic groups EP_different across the two treatments. This will determine which demographic classification has the largest impact on expertise gaps. The dependent variable will be the difference for a single decision maker in the expertise gap when they are the same demographic as the first-phase participant and when they are not.
Primary Outcomes (explanation)
Our main variable of interest, the expertise premium, is constructed by subtracting a participant's own subjective/objective accuracy from the necessary accuracy they reported in order to delegate their choice to the first phase participant.

Secondary Outcomes

Secondary Outcomes (end points)
First, we will look at whether the decision maker's own ability on the task affects the size of the expertise premium. There is a boundary problem where decision makers who do very well on the task mechanically cannot have as large an expertise premium as someone who does not as well. For example, someone with 50% accuracy can have an expertise premium of anything higher than 50%, whereas someone with 90% accuracy only has 10% above their own accuracy. So, there is a mechanical boundary problem as the accuracy moves closer to 100% for decision makers. To ensure that this does not confound our analysis, we will use a Tobit regression with upper limit censoring. We will set the censoring boundary for decision maker accuracy at the 95th percentile of total performance. For example, if the 95th percentile is 90%, then all decision makers with their own accuracy greater than or equal to 90% will be censored.

To test whether a decision maker's own accuracy affects the size of the expertise premium, we run a Tobit regression. The dependent variable is the reported necessary accuracy of the first-phase participant, X%, and the independent variable is the decision maker's own accuracy categorized by each manipulation. If the dependent variable is the expertise premium itself, we have a perfect multicollinearity issue where decision maker's own accuracy is on both sides of the regression. This categorization allows us to compare the change in magnitude of the expertise gap as a function of one's own accuracy across all versions of the task. We include individual-level fixed effects to account for the manipulations being within-subject. We will pool difficulty level and treatment if appropriate.

If the hypothesis test for difficulty level is rejected, we are interested in how the difficulty level of the task affects the expertise premium. To determine this, we have an OLS regression for each of the 7 manipulations that compares the expertise premium across the three difficulty levels. We pool treatment if appropriate. This regression is similar to the previous, but no longer has individual level fixed effects. We index the difficulty level. We have one regression per each of the 7 manipulations.
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
There are two phases for the experiment. In the first phase, we collect data that will be used to incentivize the main experiment (also called the second phase). These first-phase participants comprise the pool from which decision makers in the main experiment may choose as experts. In the main experiment, we measure the accuracy of the experimental task that an expert needs in order for decision-makers to delegate their decision agency to the expert. The phases will be between-subject and collected on Prolific.

Before explaining the objectives of the two phases, we first explain the experimental task in detail. The task of interest is a balls-in-urns-style task. There are two jars. Both jars each have 10 balls in them. The 10 balls are a combination of green, orange, and blue balls. The color composition of the 10 balls in both jars is known and displayed on the screen. One of the two jars is randomly selected (with a 50% probability), and three balls are drawn from it without replacement. The chosen jar is not known to the participant. These draws are not chosen randomly, but decided ex ante to manipulate task difficulty, and are shown to the participant. A fourth ball will be drawn from the same randomly chosen jar. The objective of the task is to choose the color that the fourth draw will be. The fourth draw is chosen randomly from the remaining 7 balls in the randomly chosen jar.

There are three versions of this baseline task: an easy, medium, and hard version. These difficulty levels have been determined prior to running the experiment based on how obvious the best answer is (meaning there is largely more of one color in the jar) and how much additional accuracy a participant may gain from correctly doing the Bayesian updating calculation, when compared to a participant that chooses at random. The difficulty levels may be different after analysis from the results of the data, and it is not integral to the experiment that they remain their intended difficulty level. Rather, these different versions of the task are to provide robustness of our results and to ensure that any results of the main experiment are not only a feature of a single set of parameters of the balls-in-urns task.

In the first phase of the experiment, participants (who we also call experts) perform 100 repetitions of the experimental task. Participants will only ever see one difficulty level of the task. More explicitly, they see the same two jars with the same color compositions and the same three ball draws, and they must guess what the fourth color drawn is 100 times. The jar that is randomly chosen and the color of the fourth draw are random and may change between each repetition of the task. Participants in this phase are told all this information and are informed that they will answer the same question, with potentially different outcomes, 100 times. The first phase is solely to collect the data in order to incentivize the main experiment. One of the 100 repetitions will be randomly chosen and the participants will receive a bonus if they are correct on that problem.

In addition, there are two different manipulations for each difficulty level of the experimental task. The jar compositions are the same for each manipulation, but the ball draws shown vary. The first manipulation changes the number of ball draws that a participant sees. A participant may see the same first three ball draws as the original version, but in addition will see a fifth and sixth ball draw as well. Their objective is to still guess the color of the fourth ball draw. Alternatively, a participant may see the same first ball draw, but not the second and third. These will be replaced with grayed-out balls, and their objective will be to still guess the fourth ball drawn. The additional ball draws will be the same for each participant, and the grayed-out balls will be the same as the original version of the task.

The second manipulation changes how informative the three balls drawn are. The three balls drawn and shown to the participant will be more or less informative than the original version of the task. The degree of informativeness of the draws is determined by the probability that a color is to be the fourth draw based on Bayesian updating calculations. Essentially, less informative draws are those that give Bayesian updating calculations closer to random, and more informative draws are those that make one color more certain than in the original version. In total, there are 15 different versions of the task, including all difficulty levels and their variations. Participants in the first phase are assigned to only one of the 15 versions in which they answer the same variation 100 times. We recruit 20 participants for each version, for a total of 300 participants in the first phase.

In the main experiment, or the second phase, participants first do 10 repetitions of one difficulty level of the original, non-manipulated task. The main question of interest is whether participants would like to use their own performance on the task or have one of the first phase participants with X% accuracy on the task replace their decisions and use them for payment. Their answer on this question is what is the X% accuracy needed for the decision maker to replace their performance with that of the first-phase participants. They may answer any X% from 0%-100% in one percent increments. The question is displayed in list format, where the decision maker must choose a row in which to switch from preferring their own performance to that of the first-phase participant. We enforce a single switch-point, so any accuracy higher than their switch-point, the decision maker must also prefer the first-phase participants' choices. For example, if a decision maker states 60% accuracy for their switch-point, they automatically choose participants in the first phase with 61%, 62%,..., 100% accuracy on the task to their own performance. In addition to answering this question for the original version of the task, they will report the necessary accuracy for all four manipulations of the task. Also, there will be two additional manipulations in which we show the decision makers whether the first-phase participant agreed or disagreed with their own guess on the task. This manipulation will randomly choose one of the 10 repetitions and display their reported answer on that repetition, as well as a repetition of a first-phase participant who agreed or disagreed with them on the answer.

There are two treatments for the main experiment. In the first treatment, decision makers do not know their own performance accuracy in the task when answering the main questions. Before they answer, we elicit their beliefs about how accurate they believed they were on the task and then display their believed accuracy to them as they answer the main questions. We call this treatment subjective accuracy. In the second treatment, we measure their beliefs between the task and the main questions, just as in the first treatment. However, participants are told their own accuracy on the task after reporting their beliefs. Their actual accuracy is displayed on the main question decision screens. We will call this treatment objective accuracy. Treatments are between-subject and beliefs in both treatments are incentivized using a Multiple Price List, where an explanation is available to decision makers.

In addition to the main experiment, we extend our experiment to determine how gender and race/ethnicity group may affect the expertise premium. This experiment has the same original task, where the ball draws are the same for the first-phase participants and decision makers. We use the same first-phase participants, from which we have collected race/ethnicity group and gender through Prolific. We will not ask them to provide these in the experiment; instead, we will export their characteristics from their profile.

We only use the medium difficulty level of the task and there are not the previous six manipulations. We also only run the Objective Accuracy treatment. In this additional experiment, our manipulations are whether the gender and race/ethnicity group of the first-phase participant is the same as the decision maker's own gender and race/ethnicity group. We look at the effect of race/ethnicity group and gender on the expertise premium separately. For our gender categories, we use men and women. For our race/ethnicity group categories, we use White (non-Hispanic), Black (non-Hispanic), Asian (non-Hispanic) and Hispanic individuals. We chose these race/ethnicity groups because they are a majority of 96\% of the people in the United States according to the United States Census. For gender, the manipulation is whether the gender of the first-phase participant is the same as the decision maker's gender. For the race/ethnicity group, we also use the manipulation of whether the race/ethnicity group of the first-phase participant is the same as the decision maker's race/ethnicity group. As their are more than two races, we do not compare all possible combinations of races/ethnicity groups. We have a manipulation where the race/ethnicity group of the first-phase participant is the same as the decision maker. For the manipulation where races/ethnicity groups are not the same, one of the other three races/ethnicity groups is randomly chosen.

The design of the experiment is the same as that of the main experiment. First, decision makers do 10 repetitions of the original, medium difficulty task. We then ask them their beliefs about their own performance and reveal their actual accuracy in the 10 repetitions. They then answer the necessary accuracy questions. The only difference is that on the necessary accuracy question screen, we include the specific demographic characteristic in an excerpt about the first-round decision makers. This excerpt is in the main experiment, but without the characteristic of interest.
Experimental Design Details
Not available
Randomization Method
Randomization done in office by computer
Randomization Unit
Individual
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
1380 individuals
Sample size: planned number of observations
1380 individuals
Sample size (or number of clusters) by treatment arms
20 adults per first-phase treatment, 100 adults per main experiment treatment, 80 adults per demographic treatment
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Power calculations were completed for each of the intended statistical tests, and our sample size is 100 per treatment for the main experiment. The minimum necessary sample size was 87 but we rounded up to 100 to be conservative. All sample size calculations were performed for a power of 0.80 and a significance level of 0.05. Our additional experiment uses the same sample size calculation parameters of power of 0.80 and significance level of 0.05. Our necessary sample size is 80 participants. Our first-phase data will not be used to do any statistical analysis, and thus we do not calculate necessary sample sizes. We collect 20 participants per first-phase treatment.
IRB

Institutional Review Boards (IRBs)

IRB Name
The Ohio State University Institutional Review Board
IRB Approval Date
2025-03-05
IRB Approval Number
2025E0339
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information