Gender Gaps at the Top: Exam Performance and Choking Under Pressure

Last registered on August 29, 2022


Trial Information

General Information

Gender Gaps at the Top: Exam Performance and Choking Under Pressure
Initial registration date
August 29, 2022

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
August 29, 2022, 5:14 PM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.


There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Norwegian School of Economics

Other Primary Investigator(s)

PI Affiliation

Additional Trial Information

In development
Start date
End date
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Being admitted to top-quality education is highly consequential for long-term outcomes. Admissions to top schools are determined by performing well in high-stakes exams in many countries around the world. Yet, a growing body of evidence shows that men outperform women in these important exams. To date, there is scarce evidence as to how and why gender differences in performance arise in high pressure, high-stakes situations. We design an experiment to explore female exam underperformance when the pressure goes up in a controlled environment. In the experiment, we give Prolific workers a series of two Raven's-type short tests containing easy, middle-difficulty and hard questions. After completing the baseline test (Test 1), we introduce treatments in Test 2 varying the level of pressure faced by participants in two ways: (1) We introduce a cutoff that determines whether participants receive a bonus payment, and (2) We increase the monetary payment per correct answer. We measure whether the treatments induce underperformance relative to a control group, and whether there are gender differences in underperformance. Lastly, we assess the role of an intervention providing participants with the full test structure and question difficulty. We plan to collect the data for this experiment starting in August 2022.
External Link(s)

Registration Citation

Franco, Catalina and Ingvild Lindgren Skarpeid. 2022. "Gender Gaps at the Top: Exam Performance and Choking Under Pressure." AEA RCT Registry. August 29.
Sponsors & Partners

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details


The main treatments variations of this study seek to increase the pressure that participants face in two ways. Relative to a pure control group, treated participants will face the introduction of a cutoff in Test 2. To increase pressure even further, participants in another treatment arm will face higher stakes through a higher monetary payment per correct answer in addition to the cutoff. We expect to generate substantial gender differences in performance with the introduction of these treatment variations. In addition, in another treatment arm we test whether knowing the level of difficulty of each question could mitigate the gender gap in performance.
Intervention Start Date
Intervention End Date

Primary Outcomes

Primary Outcomes (end points)
Fraction missing the cutoff, difference in score from Test 1 to Test 2, number of omitted questions. All specifications are OLS, regressing the outcome variable on treatment dummies, a gender dummy and the interacted treatment and gender dummies. The alternative specifications run the same OLS regression as the main specification, including a set of background controls.
Primary Outcomes (explanation)
The ”missed cutoff” variable equals 1 if the participant scored lower than 5 correct responses, and zero otherwise. Change in test score from Test 1 to Test 2 is constructed as the simple difference between score in Test 2 and score in Test 1 ***CHECK with Ingvild if we're going to follow what she wrote in the econometric specification first***

Secondary Outcomes

Secondary Outcomes (end points)
Secondary Outcomes (explanation)

Experimental Design

Experimental Design
Participants perform in two timed tests of 4 minutes, where they face 7 Raven-style IQ questions. They are informed that one of the tests will be drawn for payout, and that they should do their best on both tests to maximize their payoff. Both tests are constructed such that questions follow a set order of difficulty (unknown to participants apart from in the Difficulty treatment). The first test is paid with piece rate incentives, and the second test participants are paid according to the treatment incentives described above.

Participants can choose to respond to the questions as they appear or move questions to the end of the test. After responding to each question, they are given up to 20 seconds to answer a set of questions regarding their experience solving the question: How difficult they found it, whether they are certain that they got it right, and if they completely guessed when answering it.

In each of the tests, after the four minutes run out or after answering all seven questions, participants are asked an incentivized measure of beliefs regarding number of correct answers. The also report how much effort they exerted when answering the questions. After completing the two tests, participants answer an end questionnaire that maps neuroticism, anxiety, stress and motivation as well as how familiar they are with these types of exercises. We also collect background questions e.g. on gender, age, geographical location, household income and educational attainment.

Participants in the pure control condition are presented with piece rate incentives equivalent to 0.20GBP per correct answer in both Test 1 and Test 2.

Low incentives with cutoff
Participants face higher pressure by introducing a cutoff; if participants score less than the set cutoff of 5 correct answers they will earn zero, otherwise they earn 0.20GBP per correct answer.

High incentives with cutoff
Participants face the same cutoff, and pressure is increased even more by increasing the monetary stakes. Instead of earning 0.20GBP per correct answer, they will earn 5 GBP per correct answer, if they score equal to or above the cutoff.

Difficulty treatment
In this treatment we test whether knowing the level of difficulty of each question could mitigate the gender gap in performance under pressure. We keep the incentives and cutoff as in High incentives with cutoff, and show participants the level of difficulty of the question they are currently attempting, as well as a diagram with the overview of the difficulty level of all the questions in Test 2.
Experimental Design Details
Not available
Randomization Method
Randomization in Qualtrics: Two randomization procedures for gender separately, where participants are randomly assigned to one of the four treatment conditions.
Randomization Unit
Since the gender dimension is of particular importance to us, randomization is done at the individual level, separately for males and females, ensuring that we have equal proportions of males and females in each treatment group.
Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters
Sample size: planned number of observations
approximately, 3000 participants, 1500 males and 1500 females, budget permitting (see below)
Sample size (or number of clusters) by treatment arms
approximately 750 in each of the four treatment conditions. For the difficulty treatment we will pause data collection after 200 respondents and estimate how many participants we can afford. The final number of participants will not depend on the detected effect size but on the budget.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Meaningful detectable differences Previous literature has found that in high stakes examinations, there are small margins that contribute to large effects. In a study in Finland, a gender difference of one omitted question on average, contributed to the gender gap in admissions (Pekkarinen 2015). In our experiment we have a very limited number of questions compared with a real high stakes examination. Any meaningful treatment differences in our experiment will be small, both in omitted questions, cutoff levels and in performance score. Sample size In order to have .8 power to detect a .2 diff-in-diff, by gender and treatment, in cutoff levels between treatments, we need just under (771) participants in each treatment group.

Institutional Review Boards (IRBs)

IRB Name
IRB Approval Date
IRB Approval Number
Approved under the framework agreement NHH-IRB 31/21
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information