Testing Identifying Assumptions in Examiner IV Designs: A Vignette Experiment with Polish Prosecutors

Last registered on June 29, 2026

Pre-Trial

Trial Information

General Information

Title
Testing Identifying Assumptions in Examiner IV Designs: A Vignette Experiment with Polish Prosecutors
RCT ID
AEARCTR-0018783
Initial registration date
June 17, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
June 29, 2026, 8:24 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
BI Norwegian Business School

Other Primary Investigator(s)

PI Affiliation
Charles University
PI Affiliation
University of Cambridge

Additional Trial Information

Status
In development
Start date
2026-06-18
End date
2026-07-15
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
Many empirical studies in economics, criminology, and public policy use the random assignment of cases to decision-makers—judges, prosecutors, bail magistrates, disability examiners, patent reviewers, and caseworkers—to identify the causal effects of their decisions. These “examiner IV” designs rely on a monotonicity assumption: examiners who are stricter on average must be weakly stricter for every individual case. Yet this assumption is essentially untestable in administrative data, because the same case is rarely decided by multiple examiners.

This study conducts a vignette experiment with up to 250 Polish public prosecutors, each of whom independently recommends a sentence, including both type and severity, for the same fixed set of criminal cases. Observing many prosecutors’ decisions on identical cases allows us to measure how often monotonicity fails when decisions are made independently, without the panel-deliberation dynamics that complicate prior evidence from judicial panels. It also allows us to compute directly the case-level weights that determine whether the standard practice of controlling for non-focal decision propensities yields a well-defined estimand when examiners make multidimensional choices.

Prosecutors are randomized into four groups that vary in the homogeneity of their case mix, ranging from all-shoplifting cases to a diverse set of crime types. This design allows us to test whether monotonicity violations are driven primarily by cross-crime heterogeneity in prosecutorial preferences or by within-crime disagreement.
External Link(s)

Registration Citation

Citation
Sigstad, Henrik, Andrzej Uhl and Michal Šoltés. 2026. "Testing Identifying Assumptions in Examiner IV Designs: A Vignette Experiment with Polish Prosecutors." AEA RCT Registry. June 29. https://doi.org/10.1257/rct.18783-1.0
Sponsors & Partners

Sponsors

Experimental Details

Interventions

Intervention(s)
This study is a vignette measurement experiment. There is no behavioral intervention. Participating prosecutors complete a short online survey hosted on Qualtrics in which they read four written criminal-case vignettes and, for each case, recommend a sentence type (unsuspended imprisonment, suspended prison sentence, restriction of liberty, or fine) and the corresponding severity (e.g., months of imprisonment, months of restriction of liberty, or the number and amount of daily fine rates). Suspended prison sentences additionally elicit the trial period and whether probation supervision is imposed. The primary purpose of the study is to measure how often prosecutors, deciding identical cases independently, disagree in a way that is inconsistent with a single stringency ranking (the pairwise crossing rate underlying the monotonicity assumption used in examiner-IV designs). Because each prosecutor has time for only four vignettes, the four case-mix conditions in the analysis sample cannot all be presented to every prosecutor. Prosecutors are therefore randomly allocated by the Qualtrics built-in randomizer at survey entry across these conditions to ensure prosecutor characteristics (stringency, experience, office level, district) are balanced across conditions — the random allocation serves as a balance device, not as a behavioral treatment. The first 40 entrants are routed to a separate condition (Arm 1) reserved for projects outside this study and not used here. Subsequent entrants are allocated with equal probability across four analysis conditions, which differ in case-mix homogeneity: Arm 2 — four shoplifting cases (homogeneous); Arm 3 — four property-crime cases (shoplifting, embezzlement, burglary); Arm 4 — a mixed set (DUI, burglary, narcotics, shoplifting); Arm 5 — four diverse crime types (fraud, narcotics, driving ban, shoplifting). One shoplifting vignette appears as a common anchor in all four analysis conditions. Within each condition, all prosecutors face the same four vignettes in the same order. The survey also collects gender, years of experience, office level, and district. Prosecutors decide independently and receive no feedback or monetary incentive tied to responses. At the end of the survey, participants may optionally provide an email to be contacted for a planned later wave; the present registration covers the first wave only.
Intervention Start Date
2026-06-18
Intervention End Date
2026-07-15

Primary Outcomes

Primary Outcomes (end points)
The primary endpoint is the ranking-free pairwise crossing rate of prosecutor incarceration decisions, computed for two outcome definitions: (i) a binary incarceration indicator (1 if the prosecutor recommends unsuspended imprisonment for the case, 0 otherwise) and (ii) imprisonment length in months (months of unsuspended imprisonment, with non-imprisonment sentences coded as zero).
Primary Outcomes (explanation)
For each pair of prosecutors (i,j) within a group, let w_ij be the number of cases in which i is strictly stricter than j and w_ji the reverse, where "stricter" means incarcerates while the other does not (binary outcome) or recommends strictly more months of imprisonment (continuous outcome). The pairwise violation count for the pair is min(w_ij, w_ji), the number of "crossings" in their relative stringency. The pairwise violation rate is the sum of min(w_ij, w_ji) over all pairs with at least two disagreements, divided by the sum of (w_ij + w_ji) over the same pairs. Under perfect monotonicity this rate is exactly zero, so any strictly positive observed rate rejects the monotonicity null (H1). The measure is computed separately for each of Groups 2–5 and pooled across them, for both outcome definitions, and is reported with BCa 95% bootstrap confidence intervals obtained by resampling prosecutors with replacement.

Secondary Outcomes

Secondary Outcomes (end points)
(1) Case-level Imbens–Angrist monotonicity violation rate. (2) Average monotonicity violation rate and the implied 2SLS bias under a pre-specified severe-heterogeneity DGP. (3) Sum of negative 2SLS weights across cases. (4) Case-level own-weights and absolute cross-weights from the multi-treatment 2SLS decomposition at three pre-specified partitions of the sentence-type space (two, three, and four categories). (5) Alternative monotonicity violation rates corresponding to non-LATE estimands: extreme-pair monotonicity (LATE between extreme prosecutors), lenient-prosecutor monotonicity (LATT), and stringent-prosecutor monotonicity (LATUT). (6) Pooled-arm heterogeneity gap T = VR(4+5) − VR(2+3) comparing the heterogeneous arm to the homogeneous arm, and the per-group gradient VR_g across g ∈ {2,3,4,5}. (7) Subsampling pattern: violation rates by simulated group size m ∈ {2,3,5,9,15,N_g} and tercile of within-subsample stringency standard deviation.
Secondary Outcomes (explanation)
Imbens–Angrist (IA) monotonicity at case k holds if no prosecutor i with stringency p_i below another prosecutor j's stringency p_j has D_ik > D_jk; we report the case-level violation rate as the share of (i,j,k) ordered triples violating this. Average monotonicity at case k holds if Cov(p_i, D_ik) ≥ 0; we report the share of cases violating this and the sum over violating cases of |w_k|, where w_k = Var(p_i)^{-1} Cov(p_i, D_ik) is the 2SLS weight on case k. The implied 2SLS bias is computed under a severe-heterogeneity DGP in which defier cases (w_k < 0) have treatment effects twice the complier-case effect, yielding bias = −2|W−|/(W+ − |W−|) times the complier effect, with W+ = sum of non-negative w_k and |W−| = sum of |w_k| over cases with w_k < 0. Multi-treatment own- and cross-weights for focal treatment t and non-focal t' are w_k^{t,own} = Cov(P̃_{t,i}, D_{t,ik})/Var(P̃_{t}) and w_k^{t,cross(t')} = Cov(P̃_{t',i}, D_{t,ik})/Var(P̃_{t'}), where P̃_{t,i} is prosecutor i's residualized propensity for treatment t. The alternative monotonicity conditions follow standard non-LATE estimand definitions. The pooled-arm gap T sums pair-level numerators and denominators of the pairwise violation rate within each arm; cross-group pairs share no cases and are excluded. Stringency is computed as each prosecutor's mean incarceration rate (binary) or mean imprisonment length (continuous) over their four cases, with ties broken by small random noise. All measures are reported with BCa 95% bootstrap confidence intervals (1,000 draws, resampling prosecutors). Finite-sample bias from K = 4 cases per prosecutor is characterized via a pre-specified pair-level adapted bootstrap and a calibrated probit simulation; the reported binary pairwise violation rate is a lower bound on the population rate.

Experimental Design

Experimental Design
The study is a vignette measurement experiment with up to 250 Polish public prosecutors. The unit of random allocation is the individual prosecutor. There is no behavioral intervention and no individual-level treatment effect is estimated: the primary estimand is a within-condition measure of prosecutor disagreement (the pairwise crossing rate), and the secondary estimand is a comparison of this measure across case-mix conditions. The AEA form's treatment-vs-control frame therefore does not map onto the primary analysis. Random allocation is used because each prosecutor has time for only four vignettes, so the four case-mix conditions cannot all be presented to every prosecutor; the allocation balances prosecutor characteristics across conditions but does not itself constitute a behavioral treatment. At survey entry each prosecutor is allocated by the Qualtrics built-in randomizer to one of five arms (Arms 1–5). Arm 1 (the first 40 entrants) is reserved for projects outside this study and is not analyzed here. Arms 2–5 (~52 prosecutors each, ~208 in total) constitute the analysis sample for this registration and differ in case-mix homogeneity: Arm 2 — four shoplifting cases (homogeneous); Arm 3 — four property-crime cases (shoplifting, embezzlement, burglary; somewhat homogeneous); Arm 4 — a mixed set (DUI, burglary, narcotics, shoplifting; heterogeneous); Arm 5 — four diverse crime types (fraud, narcotics, driving ban, shoplifting; most heterogeneous). One shoplifting vignette appears as a common anchor in all four of Arms 2–5. Within each arm, all prosecutors face the same four vignettes in the same order; there is no within-arm randomization. For the primary within-arm monotonicity test all four analysis arms are symmetric — no arm functions as treatment or control. For the secondary cross-arm contrast (the heterogeneity gap T = VR(4+5) − VR(2+3) and the per-arm gradient), Arms 4+5 (heterogeneous case mix) play the role of treatment and Arms 2+3 (homogeneous case mix) the role of control. For each vignette the prosecutor recommends a sentence type (unsuspended imprisonment, suspended prison, restriction of liberty, fine) and the corresponding severity. The survey collects basic background characteristics. Prosecutors decide independently and do not interact.
Experimental Design Details
Not available
Randomization Method
Computer-generated random allocation within Qualtrics at survey entry, using the Qualtrics built-in randomizer. The allocation serves as a balance device to distribute prosecutor characteristics across the four case-mix conditions, since each prosecutor has time for only four vignettes and cannot see every condition; it does not assign a behavioral treatment. The first 40 entrants are routed deterministically to Arm 1 (reserved for projects outside this study). Each subsequent entrant is allocated with equal probability to one of Arms 2–5 until each analysis arm reaches its target of ~52 prosecutors.
Randomization Unit
Individual prosecutor.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
250 prosecutors (no clustering; cluster equals unit of observation).
Sample size: planned number of observations
250 prosecutors providing 4 vignette decisions each (1,000 prosecutor-vignette observations). The monotonicity analysis uses Groups 2–5 (~208 prosecutors, ~832 prosecutor-vignette observations).
Sample size (or number of clusters) by treatment arms
Caveat: this study has no behavioral treatment or control in the standard RCT sense — the figures below describe random allocation across four case-mix conditions used to measure within-condition prosecutor disagreement. Arm 1 (reserved for projects outside this study; not analyzed in the monotonicity test): 40 prosecutors. Arm 2 — four shoplifting cases (homogeneous case mix; cross-arm control role): ~52 prosecutors. Arm 3 — four property-crime cases (homogeneous case mix; cross-arm control role): ~52 prosecutors. Arm 4 — mixed crime types (heterogeneous case mix; cross-arm treatment role): ~52 prosecutors. Arm 5 — four diverse crime types (heterogeneous case mix; cross-arm treatment role): ~52 prosecutors. Total: up to 250 prosecutors. For the within-arm primary monotonicity test all four analysis arms (Arms 2–5) are symmetric and no arm functions as treatment or control; for the secondary heterogeneity-gradient test, Arms 2+3 pooled serve as the homogeneous (control) arm and Arms 4+5 pooled as the heterogeneous (treatment) arm.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
H1 (monotonicity violated) is trivially powered: at N = 52 prosecutors per group and K = 4 cases, the probability of observing at least one pairwise crossing exceeds 0.99 for any true pairwise violation rate of 10% or larger (and is essentially 1 at 15% or 20%). The binding precision constraint is the heterogeneity gap T = VR(4+5) − VR(2+3) (Q5). Simulations from a latent probit model calibrated to prosecutor stringency dispersion of σ_α = 0.23 and pairwise violation rates of 10%, 15%, and 20% yield an expected 95% BCa CI half-width on T of approximately 4.0 percentage points for the binary incarceration outcome and 5.1–5.7 percentage points for the continuous imprisonment-length outcome, at the planned ~52 prosecutors per group. The half-width scales as 1/√N; increasing to 75 per group tightens it by ~15%, to 100 per group by ~30%. Full details in the attached power-simulations note.
IRB

Institutional Review Boards (IRBs)

IRB Name
IRB Approval Date
IRB Approval Number
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information