Operator Gender Effects and Natural Language Processing Calibration in Violence Against Women Emergency Response: A Randomized Controlled Trial in Bogotá, Colombia

Last registered on February 19, 2026

Pre-Trial

Trial Information

General Information

Title
Operator Gender Effects and Natural Language Processing Calibration in Violence Against Women Emergency Response: A Randomized Controlled Trial in Bogotá, Colombia
RCT ID
AEARCTR-0017873
Initial registration date
February 12, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
February 19, 2026, 6:53 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Region

Primary Investigator

Affiliation
University of Manchester

Other Primary Investigator(s)

Additional Trial Information

Status
In development
Start date
2026-03-01
End date
2026-06-30
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
This study evaluates the impact of algorithmic decision support on emergency call classification and gender disparities in risk assessment at Bogotá’s Emergency Response Center (C4). I implement a 2×2 factorial randomized controlled trial crossing operator gender with randomized access to a machine-learning decision support system. A total of 275 operators across four shifts each classify approximately 20 archived, de-identified violence against women (VAW) call transcripts (5,440 observations).

Control operators follow standard classification procedures. Treatment operators receive real-time algorithmic risk suggestions, highlighted risk factors, and confidence intervals, while retaining full override authority. Calls are randomly assigned within risk strata, and expert consensus classifications from 80 specialists provide the benchmark.

I estimate three co-primary outcomes: (i) the average treatment effect of algorithmic assistance on accuracy relative to expert consensus and high-risk classification rates; (ii) gender differences in classification behavior; and (iii) whether algorithmic assistance mitigates gender gaps. The design is powered to detect a 6.2 percentage point effect on accuracy, a 4.5 percentage point gender gap in high-risk classification, and a 9.0 percentage point differential treatment effect by gender. The study provides experimental evidence on algorithmic decision support in high-stakes public safety contexts.
External Link(s)

Registration Citation

Citation
Galvis Arias, Natalia. 2026. "Operator Gender Effects and Natural Language Processing Calibration in Violence Against Women Emergency Response: A Randomized Controlled Trial in Bogotá, Colombia." AEA RCT Registry. February 19. https://doi.org/10.1257/rct.17873-1.0
Sponsors & Partners

Sponsors

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
Experimental Details

Interventions

Intervention(s)
We implement a 2×2 factorial RCT comparing algorithmic decision support versus standard manual procedures for classifying violence against women emergency calls, crossed with naturally occurring operator gender variation.

Control Condition: Operators classify archived call transcripts using existing C4 protocols—reviewing call information, conducting manual risk assessment, and assigning Low/Moderate/High risk classifications with written justification.

Treatment Condition: Operators perform identical classification task with real-time algorithmic assistance displaying: (1) predicted risk category with confidence score, (2) automatically extracted risk factors from call text, (3) linguistic pattern alerts (negation detection, hedging). Operators retain full override authority.

Assignment Mechanism: Stratified block randomization at operator level (strata: experience, shift, education). Each operator classifies ~20 randomly assigned archived calls. Random call assignment within operator eliminates selection bias from differential call exposure.

Training: All participants receive standardized 1.5-hour protocol: 1 hour VAW risk assessment principles, 30 minutes platform instruction. No algorithm-specific training to avoid differential treatment effects.

Sample: 275 operators from C4 workforce of 380, generating 5,440 total classifications over 2-month intervention.

Validation: Independent expert panels (16 panels, 80 Comisarías de Familia specialists) blindly classify random call subsample to establish accuracy benchmark. Experts blinded to operator characteristics and treatment status.
Intervention Start Date
2026-04-01
Intervention End Date
2026-05-31

Primary Outcomes

Primary Outcomes (end points)
1. Algorithm Treatment Effect: The causal impact of algorithmic decision support on classification accuracy and risk assessment patterns.

2. Operator Gender Effect: The causal relationship between operator gender and classification decisions, conditional on observables.

3. Interaction Effect: Whether algorithmic assistance differentially affects male versus female operators' classification patterns (bias mitigation)
Primary Outcomes (explanation)
Primary Outcome 1: Algorithm Treatment Effect
The algorithm treatment effect measures the causal impact of providing algorithmic decision support on operator classification accuracy and risk assessment patterns. This effect is identified through random assignment of operators to receive either algorithmic assistance or standard protocol. By random assignment, we ensure that potential outcomes under both conditions are independent of actual treatment status, conditional on stratification variables including operator experience level, shift assignment, and education. The average treatment effect can be estimated by comparing mean outcomes between randomly assigned treatment and control groups, yielding an unbiased estimate because randomization balances both observed and unobserved characteristics across groups. To examine whether the algorithm's impact varies by operator gender, we estimate conditional average treatment effects within each gender subgroup. Randomization stratified by (or balanced across) gender ensures that gender-specific treatment effects are identified within gender strata, allowing us to assess whether algorithmic assistance affects male and female operators differently.

Primary Outcome 2: Operator Gender Effect
The operator gender effect measures the causal relationship between operator gender and classification decisions, conditional on observable characteristics. Unlike algorithmic assistance, operator gender is not randomly assigned, so we cannot claim that potential outcomes are independent of gender without conditioning on observables. To identify gender effects, we invoke the conditional independence assumption: conditional on observed call characteristics and operator attributes, gender assignment is "as good as random" with respect to potential outcomes. This assumption is justified by several design features. First, calls are randomly assigned to operators within strata, ensuring that any gender differences in classification reflect differences in decision-making for identical calls rather than differential call exposure. Second, we control for extensive call-level characteristics extracted via NLP including weapons mentioned, escalation indicators, victim vulnerability, location, and timing. Third, we control for operator-level characteristics from administrative records including years of experience, education level, shift assignment, and prior VAW-specific training. Fourth, each operator classifies twenty calls with varying content, providing within-operator variation that differences out time-invariant operator characteristics. Under this conditional independence assumption, the gender effect can be estimated via regression adjustment controlling for call and operator observables.

Primary Outcome 3: Interaction Effect (Bias Mitigation)
The interaction effect tests whether algorithmic assistance differentially affects male versus female operators' classification patterns, representing a potential bias mitigation mechanism. This differential treatment effect is defined as the difference between the algorithm's impact on male operators versus its impact on female operators. Equivalently, it can be expressed as the change in the gender classification gap due to algorithmic assistance—comparing the gender gap in classification rates with the algorithm versus without it. Under random assignment of algorithmic assistance and the conditional independence assumption for gender, this interaction effect is identified by comparing treatment-control differences for male operators to treatment-control differences for female operators, conditional on call and operator characteristics. If the interaction effect equals zero, the algorithm equally affects male and female operators with no differential impact. If the interaction effect differs from zero, the algorithm has heterogeneous effects by gender. Of particular policy interest is the case where male operators underclassify risk in the control condition and the algorithm reduces this gender-based underclassification, indicating successful bias correction. This would manifest as a positive interaction effect that partially or fully closes the gender gap in risk classification.

Secondary Outcomes

Secondary Outcomes (end points)
Algorithm Performance (Treatment Arm):
1. Algorithm standalone accuracy vs. expert consensus
2. Precision and recall for high-risk classification
3. F1-score and AUC-ROC
4. Calibration (Brier score)
5. Fairness metrics (equalized odds, demographic parity)

Operator-Algorithm Interaction (Treatment Arm):
6. Override rate (% disagreements with algorithm)
7. Override accuracy
8. Agreement by algorithm confidence levels

Heterogeneous Treatment Effects:
9. Algorithm effect by operator experience
10. Gender effect by call ambiguity
11. Algorithm effect by shift
Secondary Outcomes (explanation)
Algorithm Performance: Evaluates standalone algorithm quality against expert consensus benchmark using treatment arm data. Standard machine learning metrics assess predictive accuracy and calibration.

Operator-Algorithm Interaction: Override rate = (disagreements with algorithm) / (total algorithm suggestions). Override accuracy tests whether operator corrections improve or degrade classification quality relative to expert consensus.

Experimental Design

Experimental Design
Design: 2×2 factorial randomized controlled trial crossing operator gender (naturally occurring) with algorithmic assistance (randomized).

Setting: Bogotá Emergency Response Center (C4), 2-month intervention (April-May 2026).

Sample: 275 operators across 4 shifts, 5,440 call classifications (~20 per operator).

Randomization: Individual operators randomized to algorithm-assisted or standard protocol, stratified by shift. Four 6-hour shifts serve as randomization strata:

Shift 1: 5:45 AM - 11:45 AM
Shift 2: 11:45 AM - 5:45 PM
Shift 3: 5:45 PM - 9:45 PM
Shift 4: 9:45 PM - 5:45 AM

Call Assignment: Each operator receives ~20 archived, de-identified calls randomly assigned within risk-level strata.
Interventions:

Control: Standard manual classification
Treatment: Algorithmic decision support with full operator override authority

Training: Universal 1.5-hour protocol (1 hour VAW risk assessment, 30 minutes platform instruction).

Primary Outcomes:
1. Algorithm Treatment Effect: Classification accuracy vs. expert consensus; high-risk classification rate
2. Operator Gender Effect: Gender differences in high-risk classification and accuracy
3. Interaction Effect: Differential algorithm impact by gender (bias mitigation)

Identification:
1. Algorithm effect: Experimental variation via randomization
2. Gender effect: Conditional independence assumption leveraging random call assignment and rich controls
3. Interaction: Factorial design tests bias correction mechanism

Power: 275 operators, 5,440 observations, ICC=0.12, design effect=3.28 provides:
90% power for 6.2pp algorithm effect
85% power for 4.5pp gender effect
80% power for 9.0pp interaction effect

Analysis: Linear probability models with shift-clustered standard errors:
Experimental Design Details
Not available
Randomization Method
Computer-generated random assignment using stratified block randomization with cryptographically secure random number generator. Randomization occurs at operator level after informed consent and baseline assessment completion, with allocation concealment maintained through server-side sequence generation inaccessible to research personnel.

Technical Implementation:
Stratified by shift assignment (4 shifts) to ensure balance within shifts
Block size 4 within strata ensuring local balance
1:1 allocation ratio (approximately 137-138 operators per arm)
Fixed random seed for reproducibility
Allocation sequence generated and stored on secure server prior to enrollment
Research staff blinded to upcoming assignments during recruitment and enrollment
Randomization Unit
Individual operators
Operators randomized to algorithm-assisted or standard protocol. Randomization stratified by shift assignment (4 shifts) to ensure balance within shifts.

Cluster.
Treatment assigned at operator level. Multiple calls nested within each operator creates clustering. Standard errors clustered at shift level to account for within-shift correlation.
Was the treatment clustered?
Yes

Experiment Characteristics

Sample size: planned number of clusters
4 work shifts (for clustering of standard errors)
Four 6-hour shifts serve as clusters for statistical inference, though randomization occurs at operator level.
Sample size: planned number of observations
5,440 call classifications 275 operators × ~20 calls per operator = 5,440 total observations
Sample size (or number of clusters) by treatment arms
Treatment Assignment:

Control arm (standard protocol): ~137-138 operators, ~2,720 classifications
Treatment arm (algorithm-assisted): ~137-138 operators, ~2,720 classifications

Distribution Across Shifts (each shift balanced):

Shift 1: ~35 treatment, ~34 control
Shift 2: ~35 treatment, ~34 control
Shift 3: ~35 treatment, ~34 control
Shift 4: ~34 treatment, ~34 control
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
Design Parameters: Sample: 275 operators, 5,440 calls ICC: 0.12 Average calls per operator: ~20 Design effect: 1 + (20-1)×0.12 = 3.28 Significance level: α = 0.05 (two-tailed)
Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information
IRB

Institutional Review Boards (IRBs)

IRB Name
The University of Manchester. Environment, Education & Development PGR School Panel
IRB Approval Date
2026-01-30
IRB Approval Number
2026-24512-45625