Machine Learning as a Tool to Detect and Validate Anomalous 2x2 Games

Last registered on April 23, 2026

Pre-Trial

Trial Information

General Information

Title
Machine Learning as a Tool to Detect and Validate Anomalous 2x2 Games
RCT ID
AEARCTR-0018316
Initial registration date
April 14, 2026

Initial registration date is when the trial was registered.

It corresponds to when the registration was submitted to the Registry to be reviewed for publication.

First published
April 23, 2026, 9:19 AM EDT

First published corresponds to when the trial was first made public on the Registry after being reviewed.

Locations

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information

Primary Investigator

Affiliation
CREST, Ecole Polytechnique

Other Primary Investigator(s)

PI Affiliation
CNRS, Ecole Polytechnique
PI Affiliation
Ecole Polytechnique
PI Affiliation
Ecole Polytechnique
PI Affiliation
Ecole Polytechnique
PI Affiliation
Ecole Polytechnique
PI Affiliation
Ecole Polytechnique

Additional Trial Information

Status
In development
Start date
2026-04-19
End date
2026-07-31
Secondary IDs
Prior work
This trial does not extend or rely on any prior RCTs.
Abstract
A central question in behavioral game theory is whether standard theoretical models can accurately predict human choices in simple strategic environments. Existing work suggests that some families of 2x2 games generate especially large discrepancies between theoretical predictions and observed behavior. These disparities motivate an approach in which machine learning is used not only to fit observed choices, but also to automatically generate 2x2 games specifically designed to expose the vulnerabilities of these theoretical frameworks.
The present study compares standard games and historical anomalies drawn from an existing database (\cite{Complexity2025}) with novel "anomaly" games generated by a machine learning procedure. The contribution is twofold. First, the project evaluates whether machine learning provides a more accurate predictive benchmark than standard models (Nash equilibrium, QRE, and level-k). Second, it tests whether ML-generated games reveal systematic blind spots of these standard models.
The experiment will involve participants recruited via Prolific and a pool of 120 games. These games are divided into three categories: 20 standard games, 20 historical "anomaly" games from the existing database, and 80 "anomaly" games generated by our machine-learning procedure. The games are randomly assigned to four fixed blocks of 30 games (5 standard, 5 database anomalies, and 20 ML anomalies). Each participant will play one block of 30 games. Within each block, game order will be randomized. Participants will make a strategic choice for each game and report perceived difficulty. They will also participate in a lottery game, a donation task, and a short IQ test. The main analysis will test whether database anomalies and ML-generated games produce larger discrepancies between theoretical predictions and observed behavior than standard baseline games.
External Link(s)

Registration Citation

Citation
Baron, Arthur et al. 2026. "Machine Learning as a Tool to Detect and Validate Anomalous 2x2 Games." AEA RCT Registry. April 23. https://doi.org/10.1257/rct.18316-1.0
Experimental Details

Interventions

Intervention(s)
The current experiment constitutes an out-of-sample test of whether games selected by our ML procedure also generate larger prediction errors in a new participant sample.

Each participant will play a series of two-by-two games. To avoid cognitive fatigue, each participant will play a subset of 30 games in total. The games are taken from a pool of 120 two-by-two games divided into three categories: 20 benchmark games classified as standard in the original database, 20 historical anomaly games, and 80 anomaly games generated by our ML algorithm.
Intervention Start Date
2026-04-19
Intervention End Date
2026-07-31

Primary Outcomes

Primary Outcomes (end points)
Our primary outcome is an accuracy variable equal to 1 whenever the action is correctly predicted by the model, and 0 otherwise. The unit of analysis is the subject–game–model observation.
Primary Outcomes (explanation)
For each action taken by the players and each model (Nash, Quantal Response Equilibrium, Level-k, Machine Learning), we create an accuracy variable equal to 1 whenever the action is correctly predicted by the model, and 0 otherwise.

Secondary Outcomes

Secondary Outcomes (end points)
Games features, perceived games complexity, and response times on each game.
Secondary Outcomes (explanation)
We look at games features that likely correlate with complexity: Dominant Solvability, Excess Dissimilarity, Levels of Iterative Rationality, Number of Nash Equilibria, Nash Equilibrium Payoff Dominance, Nash Equilibrium Pareto Dominance, Pure Motives, Max Payouts, Payoff Variances, Deviations from Zero-Sum Games, Inequality in Payouts, and Asymmetry in Payouts.
Perceived game complexity is asked at the end of each game to each participant.
We record the response time of each participant on each game they play.

Experimental Design

Experimental Design
Each participant will play a series of two-by-two games. To avoid cognitive fatigue, each participant will play a subset of 30 games in total. The games are taken from a pool of 120 two-by-two games divided into three categories: 20 benchmark games classified as standard in the original database, 20 historical anomaly games, and 80 anomaly games generated by our ML algorithm.
To ensure uniform exposure, the 120 games are divided into 4 fixed blocks using a block-randomization design. Each block contains exactly 30 games (5 standard, 5 database anomalies, and 20 ML anomalies). Participants are randomly assigned to one of these blocks upon entry. Within each assigned block, the order of the 30 games is randomized to prevent sequence effects. Participants are randomly matched with each other, ensuring they face a different player at each game. Each player sees the game as the row player and, therefore, has to choose between playing "Up'' or "Down''. At the end of the survey, 4 games will be drawn at random to determine payment.
At the end of each game, participants are asked to report their perceived difficulty of the game.
After completing the 30 games and rating their perceived difficulty, participants complete a lottery-choice task, a donation task, and a short IQ test to elicit risk aversion, altruism, and cognitive ability, respectively. The lottery task and the donation task are incentivized with payments.
Experimental Design Details
Not available
Randomization Method
Randomization into block and randomization to match participants are done by a computer.
Randomization Unit
Randomizations are at the individual level.
Was the treatment clustered?
No

Experiment Characteristics

Sample size: planned number of clusters
200 participants.
Sample size: planned number of observations
We will have 24,000 observations at the player-game-model level (200x30x4)
Sample size (or number of clusters) by treatment arms
Of all actions: 2/12 will concern standard games, 2/12 anomaly games from the original dataset, 8/12 newly generated anomaly games.
Minimum detectable effect size for main outcomes (accounting for sample design and clustering)
The minimum sample size needed to detect an effect is estimated at 2,500 observations.
IRB

Institutional Review Boards (IRBs)

IRB Name
Institut Louis Bachelier, Institutional Review Board IRB00013336
IRB Approval Date
2026-04-02
IRB Approval Number
ILB-2026-005
Analysis Plan

There is information in this trial unavailable to the public. Use the button below to request access.

Request Information