Operator Gender Effects and Natural Language Processing Calibration in Violence Against Women Emergency Response: A Randomized Controlled Trial in Bogotá, Colombia

Last registered on February 19, 2026

Pre-Trial

General Information

Title

RCT ID

AEARCTR-0017873

Initial registration date

February 12, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published

February 19, 2026, 6:53 AM EST

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

Country

Colombia

Region

Bogotá, DC.

Primary Investigator

Name

Natalia Galvis Arias

Affiliation

University of Manchester

Contact Primary Investigator

Other Primary Investigator(s)

Additional Trial Information

Status

In development

Start date

2026-03-01

End date

2026-06-30

Keywords

Crime, Violence, & Conflict, Gender, Governance

Additional Keywords

Natural Language Processing, Violence Against Women, Emergency Response Systems, Gender Bias, Algorithm-Assisted Decision Making, Public Safety

JEL code(s)

C93, J16, O33, K42

Secondary IDs

Prior work

This trial does not extend or rely on any prior RCTs.

Abstract

This study evaluates the impact of algorithmic decision support on emergency call classification and gender disparities in risk assessment at Bogotá’s Emergency Response Center (C4). I implement a 2×2 factorial randomized controlled trial crossing operator gender with randomized access to a machine-learning decision support system. A total of 275 operators across four shifts each classify approximately 20 archived, de-identified violence against women (VAW) call transcripts (5,440 observations).

Control operators follow standard classification procedures. Treatment operators receive real-time algorithmic risk suggestions, highlighted risk factors, and confidence intervals, while retaining full override authority. Calls are randomly assigned within risk strata, and expert consensus classifications from 80 specialists provide the benchmark.

I estimate three co-primary outcomes: (i) the average treatment effect of algorithmic assistance on accuracy relative to expert consensus and high-risk classification rates; (ii) gender differences in classification behavior; and (iii) whether algorithmic assistance mitigates gender gaps. The design is powered to detect a 6.2 percentage point effect on accuracy, a 4.5 percentage point gender gap in high-risk classification, and a 9.0 percentage point differential treatment effect by gender. The study provides experimental evidence on algorithmic decision support in high-stakes public safety contexts.

External Link(s)

Registration Citation

Citation

Galvis Arias, Natalia. 2026. "Operator Gender Effects and Natural Language Processing Calibration in Violence Against Women Emergency Response: A Randomized Controlled Trial in Bogotá, Colombia." AEA RCT Registry. February 19. https://doi.org/10.1257/rct.17873-1.0

Sponsors & Partners

Interventions

Intervention(s)

We implement a 2×2 factorial RCT comparing algorithmic decision support versus standard manual procedures for classifying violence against women emergency calls, crossed with naturally occurring operator gender variation.

Control Condition: Operators classify archived call transcripts using existing C4 protocols—reviewing call information, conducting manual risk assessment, and assigning Low/Moderate/High risk classifications with written justification.

Treatment Condition: Operators perform identical classification task with real-time algorithmic assistance displaying: (1) predicted risk category with confidence score, (2) automatically extracted risk factors from call text, (3) linguistic pattern alerts (negation detection, hedging). Operators retain full override authority.

Assignment Mechanism: Stratified block randomization at operator level (strata: experience, shift, education). Each operator classifies ~20 randomly assigned archived calls. Random call assignment within operator eliminates selection bias from differential call exposure.

Training: All participants receive standardized 1.5-hour protocol: 1 hour VAW risk assessment principles, 30 minutes platform instruction. No algorithm-specific training to avoid differential treatment effects.

Sample: 275 operators from C4 workforce of 380, generating 5,440 total classifications over 2-month intervention.

Validation: Independent expert panels (16 panels, 80 Comisarías de Familia specialists) blindly classify random call subsample to establish accuracy benchmark. Experts blinded to operator characteristics and treatment status.

Intervention Start Date

2026-04-01

Intervention End Date

2026-05-31

Primary Outcomes

Primary Outcomes (end points)

1. Algorithm Treatment Effect: The causal impact of algorithmic decision support on classification accuracy and risk assessment patterns.

2. Operator Gender Effect: The causal relationship between operator gender and classification decisions, conditional on observables.

3. Interaction Effect: Whether algorithmic assistance differentially affects male versus female operators' classification patterns (bias mitigation)

Primary Outcomes (explanation)

Primary Outcome 1: Algorithm Treatment Effect
The algorithm treatment effect measures the causal impact of providing algorithmic decision support on operator classification accuracy and risk assessment patterns. This effect is identified through random assignment of operators to receive either algorithmic assistance or standard protocol. By random assignment, we ensure that potential outcomes under both conditions are independent of actual treatment status, conditional on stratification variables including operator experience level, shift assignment, and education. The average treatment effect can be estimated by comparing mean outcomes between randomly assigned treatment and control groups, yielding an unbiased estimate because randomization balances both observed and unobserved characteristics across groups. To examine whether the algorithm's impact varies by operator gender, we estimate conditional average treatment effects within each gender subgroup. Randomization stratified by (or balanced across) gender ensures that gender-specific treatment effects are identified within gender strata, allowing us to assess whether algorithmic assistance affects male and female operators differently.

Primary Outcome 2: Operator Gender Effect
The operator gender effect measures the causal relationship between operator gender and classification decisions, conditional on observable characteristics. Unlike algorithmic assistance, operator gender is not randomly assigned, so we cannot claim that potential outcomes are independent of gender without conditioning on observables. To identify gender effects, we invoke the conditional independence assumption: conditional on observed call characteristics and operator attributes, gender assignment is "as good as random" with respect to potential outcomes. This assumption is justified by several design features. First, calls are randomly assigned to operators within strata, ensuring that any gender differences in classification reflect differences in decision-making for identical calls rather than differential call exposure. Second, we control for extensive call-level characteristics extracted via NLP including weapons mentioned, escalation indicators, victim vulnerability, location, and timing. Third, we control for operator-level characteristics from administrative records including years of experience, education level, shift assignment, and prior VAW-specific training. Fourth, each operator classifies twenty calls with varying content, providing within-operator variation that differences out time-invariant operator characteristics. Under this conditional independence assumption, the gender effect can be estimated via regression adjustment controlling for call and operator observables.

Primary Outcome 3: Interaction Effect (Bias Mitigation)
The interaction effect tests whether algorithmic assistance differentially affects male versus female operators' classification patterns, representing a potential bias mitigation mechanism. This differential treatment effect is defined as the difference between the algorithm's impact on male operators versus its impact on female operators. Equivalently, it can be expressed as the change in the gender classification gap due to algorithmic assistance—comparing the gender gap in classification rates with the algorithm versus without it. Under random assignment of algorithmic assistance and the conditional independence assumption for gender, this interaction effect is identified by comparing treatment-control differences for male operators to treatment-control differences for female operators, conditional on call and operator characteristics. If the interaction effect equals zero, the algorithm equally affects male and female operators with no differential impact. If the interaction effect differs from zero, the algorithm has heterogeneous effects by gender. Of particular policy interest is the case where male operators underclassify risk in the control condition and the algorithm reduces this gender-based underclassification, indicating successful bias correction. This would manifest as a positive interaction effect that partially or fully closes the gender gap in risk classification.

Secondary Outcomes

Secondary Outcomes (end points)

Algorithm Performance (Treatment Arm):
1. Algorithm standalone accuracy vs. expert consensus
2. Precision and recall for high-risk classification
3. F1-score and AUC-ROC
4. Calibration (Brier score)
5. Fairness metrics (equalized odds, demographic parity)

Operator-Algorithm Interaction (Treatment Arm):
6. Override rate (% disagreements with algorithm)
7. Override accuracy
8. Agreement by algorithm confidence levels

Heterogeneous Treatment Effects:
9. Algorithm effect by operator experience
10. Gender effect by call ambiguity
11. Algorithm effect by shift

Secondary Outcomes (explanation)

Algorithm Performance: Evaluates standalone algorithm quality against expert consensus benchmark using treatment arm data. Standard machine learning metrics assess predictive accuracy and calibration.

Operator-Algorithm Interaction: Override rate = (disagreements with algorithm) / (total algorithm suggestions). Override accuracy tests whether operator corrections improve or degrade classification quality relative to expert consensus.

Experimental Design

Design: 2×2 factorial randomized controlled trial crossing operator gender (naturally occurring) with algorithmic assistance (randomized).

Setting: Bogotá Emergency Response Center (C4), 2-month intervention (April-May 2026).

Sample: 275 operators across 4 shifts, 5,440 call classifications (~20 per operator).

Randomization: Individual operators randomized to algorithm-assisted or standard protocol, stratified by shift. Four 6-hour shifts serve as randomization strata:

Shift 1: 5:45 AM - 11:45 AM
Shift 2: 11:45 AM - 5:45 PM
Shift 3: 5:45 PM - 9:45 PM
Shift 4: 9:45 PM - 5:45 AM

Call Assignment: Each operator receives ~20 archived, de-identified calls randomly assigned within risk-level strata.
Interventions:

Control: Standard manual classification
Treatment: Algorithmic decision support with full operator override authority

Training: Universal 1.5-hour protocol (1 hour VAW risk assessment, 30 minutes platform instruction).

Primary Outcomes:
1. Algorithm Treatment Effect: Classification accuracy vs. expert consensus; high-risk classification rate
2. Operator Gender Effect: Gender differences in high-risk classification and accuracy
3. Interaction Effect: Differential algorithm impact by gender (bias mitigation)

Identification:
1. Algorithm effect: Experimental variation via randomization
2. Gender effect: Conditional independence assumption leveraging random call assignment and rich controls
3. Interaction: Factorial design tests bias correction mechanism

Power: 275 operators, 5,440 observations, ICC=0.12, design effect=3.28 provides:
90% power for 6.2pp algorithm effect
85% power for 4.5pp gender effect
80% power for 9.0pp interaction effect

Analysis: Linear probability models with shift-clustered standard errors:

Experimental Design Details

Not available

Randomization Method

Computer-generated random assignment using stratified block randomization with cryptographically secure random number generator. Randomization occurs at operator level after informed consent and baseline assessment completion, with allocation concealment maintained through server-side sequence generation inaccessible to research personnel.

Technical Implementation:
Stratified by shift assignment (4 shifts) to ensure balance within shifts
Block size 4 within strata ensuring local balance
1:1 allocation ratio (approximately 137-138 operators per arm)
Fixed random seed for reproducibility
Allocation sequence generated and stored on secure server prior to enrollment
Research staff blinded to upcoming assignments during recruitment and enrollment

Randomization Unit

Individual operators
Operators randomized to algorithm-assisted or standard protocol. Randomization stratified by shift assignment (4 shifts) to ensure balance within shifts.

Cluster.
Treatment assigned at operator level. Multiple calls nested within each operator creates clustering. Standard errors clustered at shift level to account for within-shift correlation.

Was the treatment clustered?

Yes

Experiment Characteristics

Sample size: planned number of clusters

4 work shifts (for clustering of standard errors)
Four 6-hour shifts serve as clusters for statistical inference, though randomization occurs at operator level.

Sample size: planned number of observations

5,440 call classifications 275 operators × ~20 calls per operator = 5,440 total observations

Sample size (or number of clusters) by treatment arms

Treatment Assignment:

Control arm (standard protocol): ~137-138 operators, ~2,720 classifications
Treatment arm (algorithm-assisted): ~137-138 operators, ~2,720 classifications

Distribution Across Shifts (each shift balanced):

Shift 1: ~35 treatment, ~34 control
Shift 2: ~35 treatment, ~34 control
Shift 3: ~35 treatment, ~34 control
Shift 4: ~34 treatment, ~34 control

Minimum detectable effect size for main outcomes (accounting for sample design and clustering)

Design Parameters: Sample: 275 operators, 5,440 calls ICC: 0.12 Average calls per operator: ~20 Design effect: 1 + (20-1)×0.12 = 3.28 Significance level: α = 0.05 (two-tailed)

Supporting Documents and Materials

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

IRB

Institutional Review Boards (IRBs)

IRB Name

The University of Manchester. Environment, Education & Development PGR School Panel

IRB Approval Date

2026-01-30

IRB Approval Number

2026-24512-45625

Analysis Plan