Testing Identifying Assumptions in Examiner IV Designs: A Vignette Experiment with Polish Prosecutors

Last registered on July 03, 2026

View Trial History

Pre-Trial

Trial Information

General Information

Title

Testing Identifying Assumptions in Examiner IV Designs: A Vignette Experiment with Polish Prosecutors

RCT ID

AEARCTR-0018783

Initial registration date

June 17, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

June 29, 2026, 8:24 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Last updated

July 03, 2026, 4:02 AM EDT

Last updated is the most recent time when changes to the trial's registration were published.

Locations

Country

Poland

Region

Primary Investigator

Name

Henrik Sigstad

Affiliation

BI Norwegian Business School

Contact Primary Investigator

Other Primary Investigator(s)

PI Name

Michal Šoltés

PI Affiliation

Charles University

Contact Investigator

PI Name

Andrzej Uhl

PI Affiliation

University of Cambridge

Contact Investigator

Additional Trial Information

Status

In development

Start date

2026-06-18

End date

2026-07-15

Keywords

Crime, Violence, & Conflict, Governance

Additional Keywords

prosecutor

JEL code(s)

C26, C93, K14, K42

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

Many empirical studies in economics, criminology, and public policy use the random assignment of cases to decision-makers—judges, prosecutors, bail magistrates, disability examiners, patent reviewers, and caseworkers—to identify the causal effects of their decisions. These “examiner IV” designs rely on a monotonicity assumption: examiners who are stricter on average must be weakly stricter for every individual case. Yet this assumption is essentially untestable in administrative data, because the same case is rarely decided by multiple examiners.

This study conducts a vignette experiment with up to 250 Polish public prosecutors, each of whom independently recommends a sentence, including both type and severity, for the same fixed set of criminal cases. Observing many prosecutors’ decisions on identical cases allows us to measure how often monotonicity fails when decisions are made independently, without the panel-deliberation dynamics that complicate prior evidence from judicial panels. It also allows us to compute directly the case-level weights that determine whether the standard practice of controlling for non-focal decision propensities yields a well-defined estimand when examiners make multidimensional choices.

Prosecutors are randomized into four groups that vary in the homogeneity of their case mix, ranging from all-shoplifting cases to a diverse set of crime types. This design allows us to test whether monotonicity violations are driven primarily by cross-crime heterogeneity in prosecutorial preferences or by within-crime disagreement.

External Link(s)

Registration Citation

Citation

Sigstad, Henrik, Andrzej Uhl and Michal Šoltés. 2026. "Testing Identifying Assumptions in Examiner IV Designs: A Vignette Experiment with Polish Prosecutors." AEA RCT Registry. July 03. https://doi.org/10.1257/rct.18783-1.1

Sponsors & Partners

Interventions

Intervention(s)

This study is a vignette measurement experiment. There is no behavioral intervention. Participating prosecutors complete a short online survey hosted on Qualtrics in which they read four written criminal-case vignettes and, for each case, recommend a sentence type (unsuspended imprisonment, suspended prison sentence, restriction of liberty, or fine) and the corresponding severity (e.g., months of imprisonment, months of restriction of liberty, or the number and amount of daily fine rates). Suspended prison sentences additionally elicit the trial period and whether probation supervision is imposed. The primary purpose of the study is to measure how often prosecutors, deciding identical cases independently, disagree in a way that is inconsistent with a single stringency ranking (the pairwise crossing rate underlying the monotonicity assumption used in examiner-IV designs). Because each prosecutor has time for only four vignettes, the four case-mix conditions in the analysis sample cannot all be presented to every prosecutor. Prosecutors are therefore randomly allocated by the Qualtrics built-in randomizer at survey entry across these conditions to ensure prosecutor characteristics (stringency, experience, office level, district) are balanced across conditions — the random allocation serves as a balance device, not as a behavioral treatment. The first 40 entrants are routed to a separate condition (Arm 1) reserved for projects outside this study and not used here. Subsequent entrants are allocated with equal probability across four analysis conditions, which differ in case-mix homogeneity: Arm 2 — four shoplifting cases (homogeneous); Arm 3 — four property-crime cases (shoplifting, embezzlement, burglary); Arm 4 — a mixed set (DUI, burglary, narcotics, shoplifting); Arm 5 — four diverse crime types (fraud, narcotics, driving ban, shoplifting). One shoplifting vignette appears as a common anchor in all four analysis conditions. Within each condition, all prosecutors face the same four vignettes in the same order. The survey also collects gender, years of experience, office level, and district. Prosecutors decide independently and receive no feedback or monetary incentive tied to responses. At the end of the survey, participants may optionally provide an email to be contacted for a planned later wave; the present registration covers the first wave only.

Intervention (Hidden)

The study is an online vignette experiment with Polish public prosecutors administered via Qualtrics. Up to 250 prosecutors are recruited through Charles University's collaborators in Poland and randomly assigned at survey entry to one of five groups. Group 1 (the first 40 prosecutors) is reserved for other projects within a broader data collection initiative on prosecutorial and judicial decision-making and is not used in the monotonicity analysis. Groups 2–5 (approximately 52 prosecutors each, ~208 in total) constitute the analysis sample and differ in the homogeneity of the case mix they face: Group 2 sees four shoplifting cases; Group 3 sees four property-crime cases (shoplifting, embezzlement, burglary); Group 4 sees a mixed set (DUI, burglary, narcotics, shoplifting); Group 5 sees four diverse crime types (fraud, narcotics, driving ban, shoplifting). One shoplifting vignette appears as an anchor in all four of Groups 2–5, and within each group all prosecutors face the same four vignettes in the same order—there is no within-group randomization beyond group assignment. For each vignette, prosecutors recommend a sentence type from the menu of legally available options (imprisonment without conditional suspension, suspended prison sentence, restriction of liberty, fine) and specify the severity: months of imprisonment for unsuspended and suspended prison, months of restriction of liberty, or the number of daily rates and the rate amount for fines. Suspended prison sentences additionally elicit the trial period and whether probation supervision is imposed. The vignettes are calibrated to generate non-trivial variation in incarceration decisions across prosecutors. Background characteristics collected include gender, years of experience, prosecutorial office level, and district of the prosecutor's office. Prosecutors decide independently and receive no feedback or monetary incentive tied to responses. An optional email is collected at the end of the survey for a planned later wave; the present registration covers the first wave only.

Intervention Start Date

2026-06-18

Intervention End Date

2026-07-15

Primary Outcomes

Primary Outcomes (end points)

The primary endpoint is the ranking-free pairwise crossing rate of prosecutor incarceration decisions, computed for two outcome definitions: (i) a binary incarceration indicator (1 if the prosecutor recommends unsuspended imprisonment for the case, 0 otherwise) and (ii) imprisonment length in months (months of unsuspended imprisonment, with non-imprisonment sentences coded as zero).

Primary Outcomes (explanation)

For each pair of prosecutors (i,j) within a group, let w_ij be the number of cases in which i is strictly stricter than j and w_ji the reverse, where "stricter" means incarcerates while the other does not (binary outcome) or recommends strictly more months of imprisonment (continuous outcome). The pairwise violation count for the pair is min(w_ij, w_ji), the number of "crossings" in their relative stringency. The pairwise violation rate is the sum of min(w_ij, w_ji) over all pairs with at least two disagreements, divided by the sum of (w_ij + w_ji) over the same pairs. Under perfect monotonicity this rate is exactly zero, so any strictly positive observed rate rejects the monotonicity null (H1). The measure is computed separately for each of Groups 2–5 and pooled across them, for both outcome definitions, and is reported with BCa 95% bootstrap confidence intervals obtained by resampling prosecutors with replacement.

Secondary Outcomes

Secondary Outcomes (end points)

(1) Case-level Imbens–Angrist monotonicity violation rate. (2) Average monotonicity violation rate and the implied 2SLS bias under a pre-specified severe-heterogeneity DGP. (3) Sum of negative 2SLS weights across cases. (4) Case-level own-weights and absolute cross-weights from the multi-treatment 2SLS decomposition at three pre-specified partitions of the sentence-type space (two, three, and four categories). (5) Alternative monotonicity violation rates corresponding to non-LATE estimands: extreme-pair monotonicity (LATE between extreme prosecutors), lenient-prosecutor monotonicity (LATT), and stringent-prosecutor monotonicity (LATUT). (6) Pooled-arm heterogeneity gap T = VR(4+5) − VR(2+3) comparing the heterogeneous arm to the homogeneous arm, and the per-group gradient VR_g across g ∈ {2,3,4,5}. (7) Subsampling pattern: violation rates by simulated group size m ∈ {2,3,5,9,15,N_g} and tercile of within-subsample stringency standard deviation.

Secondary Outcomes (explanation)

Imbens–Angrist (IA) monotonicity at case k holds if no prosecutor i with stringency p_i below another prosecutor j's stringency p_j has D_ik > D_jk; we report the case-level violation rate as the share of (i,j,k) ordered triples violating this. Average monotonicity at case k holds if Cov(p_i, D_ik) ≥ 0; we report the share of cases violating this and the sum over violating cases of |w_k|, where w_k = Var(p_i)^{-1} Cov(p_i, D_ik) is the 2SLS weight on case k. The implied 2SLS bias is computed under a severe-heterogeneity DGP in which defier cases (w_k < 0) have treatment effects twice the complier-case effect, yielding bias = −2|W−|/(W+ − |W−|) times the complier effect, with W+ = sum of non-negative w_k and |W−| = sum of |w_k| over cases with w_k < 0. Multi-treatment own- and cross-weights for focal treatment t and non-focal t' are w_k^{t,own} = Cov(P̃_{t,i}, D_{t,ik})/Var(P̃_{t}) and w_k^{t,cross(t')} = Cov(P̃_{t',i}, D_{t,ik})/Var(P̃_{t'}), where P̃_{t,i} is prosecutor i's residualized propensity for treatment t. The alternative monotonicity conditions follow standard non-LATE estimand definitions. The pooled-arm gap T sums pair-level numerators and denominators of the pairwise violation rate within each arm; cross-group pairs share no cases and are excluded. Stringency is computed as each prosecutor's mean incarceration rate (binary) or mean imprisonment length (continuous) over their four cases, with ties broken by small random noise. All measures are reported with BCa 95% bootstrap confidence intervals (1,000 draws, resampling prosecutors). Finite-sample bias from K = 4 cases per prosecutor is characterized via a pre-specified pair-level adapted bootstrap and a calibrated probit simulation; the reported binary pairwise violation rate is a lower bound on the population rate.

Experimental Design

The study is a vignette measurement experiment with up to 250 Polish public prosecutors. The unit of random allocation is the individual prosecutor. There is no behavioral intervention and no individual-level treatment effect is estimated: the primary estimand is a within-condition measure of prosecutor disagreement (the pairwise crossing rate), and the secondary estimand is a comparison of this measure across case-mix conditions. The AEA form's treatment-vs-control frame therefore does not map onto the primary analysis. Random allocation is used because each prosecutor has time for only four vignettes, so the four case-mix conditions cannot all be presented to every prosecutor; the allocation balances prosecutor characteristics across conditions but does not itself constitute a behavioral treatment. At survey entry each prosecutor is allocated by the Qualtrics built-in randomizer to one of five arms (Arms 1–5). Arm 1 (the first 40 entrants) is reserved for projects outside this study and is not analyzed here. Arms 2–5 (~52 prosecutors each, ~208 in total) constitute the analysis sample for this registration and differ in case-mix homogeneity: Arm 2 — four shoplifting cases (homogeneous); Arm 3 — four property-crime cases (shoplifting, embezzlement, burglary; somewhat homogeneous); Arm 4 — a mixed set (DUI, burglary, narcotics, shoplifting; heterogeneous); Arm 5 — four diverse crime types (fraud, narcotics, driving ban, shoplifting; most heterogeneous). One shoplifting vignette appears as a common anchor in all four of Arms 2–5. Within each arm, all prosecutors face the same four vignettes in the same order; there is no within-arm randomization. For the primary within-arm monotonicity test all four analysis arms are symmetric — no arm functions as treatment or control. For the secondary cross-arm contrast (the heterogeneity gap T = VR(4+5) − VR(2+3) and the per-arm gradient), Arms 4+5 (heterogeneous case mix) play the role of treatment and Arms 2+3 (homogeneous case mix) the role of control. For each vignette the prosecutor recommends a sentence type (unsuspended imprisonment, suspended prison, restriction of liberty, fine) and the corresponding severity. The survey collects basic background characteristics. Prosecutors decide independently and do not interact.

Experimental Design Details

Same content as Intervention (Hidden) above, with additional detail on analysis: from the prosecutor-by-case decision matrix we construct two outcome variables—binary incarceration and continuous imprisonment length in months—and compute the pairwise crossing rate (primary), the IA case-level violation rate, the average monotonicity violation rate, the sum of negative 2SLS weights, multi-treatment own- and cross-weights at three pre-specified partitions of the sentence-type space, and three alternative monotonicity conditions (extreme-pair, lenient-prosecutor, stringent-prosecutor) corresponding to non-LATE estimands. The cross-group design lets us compare the pooled pairwise violation rate between a homogeneous arm (Groups 2 and 3) and a heterogeneous arm (Groups 4 and 5), and trace the per-group gradient VR_g. Inference uses BCa 95% bootstrap confidence intervals (1,000 draws) resampling prosecutors within each group. Finite-sample bias from K = 4 cases per prosecutor is characterized via a pair-level adapted bootstrap (trinomial calibrated to each pair's disagreement rate and the pooled VR) and a calibrated probit simulation that matches prosecutor- and case-level marginals; the binary pairwise violation rate is therefore reported as a lower bound on the population rate.

Randomization Method

Computer-generated random allocation within Qualtrics at survey entry, using the Qualtrics built-in randomizer. The allocation serves as a balance device to distribute prosecutor characteristics across the four case-mix conditions, since each prosecutor has time for only four vignettes and cannot see every condition; it does not assign a behavioral treatment. The first 40 entrants are routed deterministically to Arm 1 (reserved for projects outside this study). Each subsequent entrant is allocated with equal probability to one of Arms 2–5 until each analysis arm reaches its target of ~52 prosecutors.

Randomization Unit

Individual prosecutor.

Was the treatment clustered?

Experiment Characteristics

Sample size: planned number of clusters

250 prosecutors (no clustering; cluster equals unit of observation).

Sample size: planned number of observations

250 prosecutors providing 4 vignette decisions each (1,000 prosecutor-vignette observations). The monotonicity analysis uses Groups 2–5 (~208 prosecutors, ~832 prosecutor-vignette observations).

Sample size (or number of clusters) by treatment arms

Caveat: this study has no behavioral treatment or control in the standard RCT sense — the figures below describe random allocation across four case-mix conditions used to measure within-condition prosecutor disagreement. Arm 1 (reserved for projects outside this study; not analyzed in the monotonicity test): 40 prosecutors. Arm 2 — four shoplifting cases (homogeneous case mix; cross-arm control role): ~52 prosecutors. Arm 3 — four property-crime cases (homogeneous case mix; cross-arm control role): ~52 prosecutors. Arm 4 — mixed crime types (heterogeneous case mix; cross-arm treatment role): ~52 prosecutors. Arm 5 — four diverse crime types (heterogeneous case mix; cross-arm treatment role): ~52 prosecutors. Total: up to 250 prosecutors. For the within-arm primary monotonicity test all four analysis arms (Arms 2–5) are symmetric and no arm functions as treatment or control; for the secondary heterogeneity-gradient test, Arms 2+3 pooled serve as the homogeneous (control) arm and Arms 4+5 pooled as the heterogeneous (treatment) arm.

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

H1 (monotonicity violated) is trivially powered: at N = 52 prosecutors per group and K = 4 cases, the probability of observing at least one pairwise crossing exceeds 0.99 for any true pairwise violation rate of 10% or larger (and is essentially 1 at 15% or 20%). The binding precision constraint is the heterogeneity gap T = VR(4+5) − VR(2+3) (Q5). Simulations from a latent probit model calibrated to prosecutor stringency dispersion of σ_α = 0.23 and pairwise violation rates of 10%, 15%, and 20% yield an expected 95% BCa CI half-width on T of approximately 4.0 percentage points for the binary incarceration outcome and 5.1–5.7 percentage points for the continuous imprisonment-length outcome, at the planned ~52 prosecutors per group. The half-width scales as 1/√N; increasing to 75 per group tightens it by ~15%, to 100 per group by ~30%. Full details in the attached power-simulations note.

Supporting Documents and Materials

IRB