Latest Validation: Evaluating anthropomorphic alignment in next-gen models. New data for Anthropic Claude Haiku 4.5 included. View Results

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

A reusable platform converting human study research papers into standardized testbed for AI agents to replay human-subject experiments end-to-end, evaluating agent alignment with human participants at the level of scientific inference.

Loading effect size data…

What is HumanStudy-Bench?

HumanStudy-Bench treats participant simulation as an agent design problem and provides a standardized testbed — combining an Execution Engine that reconstructs full experimental protocols from published studies and a Benchmark with standardized evaluation metrics — for replaying human-subject experiments end-to-end with alignment evaluation at the level of scientific inference.

Standardized Testbed

Test different agent designs on the same experiments, run agents through real studies covering 6,000+ trials, and compare results rigorously using inferential-level metrics.

12
Foundational Studies
Covering major behavioral phenomena
6,000+
Experimental Trials
Replayed with AI agents
10-2,000+
Human Sample Range
Per study participant count
2
Evaluation Metrics
PAS & ECS for alignment

Pipeline Architecture

From published human studies to reusable simulation environment in four stages.

Stage 1: Filter

Curates human studies that are scientifically important and practically reproducible, ensuring full experimental details, quantifiable outcomes, and simulation feasibility.

Stage 2: Extract

Extracts participants' profiles, experimental designs, statistical tests, and human ground-truth outcomes from unstructured papers into machine-executable representations.

Stage 3: Execute

Runs agent designs through reconstructed experimental protocols, generating trial-level data via a shared execution engine that handles agent sampling, instruction dispatch, and response collection.

Stage 4: Evaluate

Compares agent responses against human ground-truth using Probability Alignment Score (PAS) for inferential agreement and Effect Consistency Score (ECS) for effect-size alignment.

Evaluation Metrics

PASProbability Alignment Score

Measures whether agents reach the same scientific conclusions as humans at the phenomenon level. It quantifies the probability that agent and human populations exhibit behavior consistent with the same hypothesis.

ECSEffect Consistency Score

Measures how closely agents reproduce the magnitude and pattern of human behavioral effects at the data level. It assesses both the precision and accuracy of agent responses compared to human ground truth.

Leaderboard

Evaluating agent design alignment with human behavior using Probability Alignment Score (PAS) and Effect Consistency Score (ECS) across 12 foundational human-subject studies.

Filter by Variant:
RankModelVariantPAS (Alignment)ECSCostTokensDetails
1gemini-3-flash-previewv3_human_plus_demo49.7%0.1593$2.78833,891,506Show
2gemini-3-flash-previewv4_background46.5%0.0076$5.27437,786,439Show
3gpt-5-nanov4_background45.9%0.0498$0.39195,197,263Show
4mistral-nemov3_human_plus_demo44.0%0.1311$0.44514,789,725Show
5qwen3-next-80b-a3b-instructv4_background43.4%0.1142$0.95274,695,595Show
6mistral-nemov4_background43.2%0.0389$0.20046,973,836Show
7mistral-nemov1_empty42.7%0.0700$0.39164,277,209Show
8gpt-oss-20bv1_empty41.9%0.0284$1.41588,450,730Show
9gpt-oss-20bv3_human_plus_demo41.8%0.0223$1.16777,561,478Show
10mistral-nemov2_human41.1%0.0339$0.38724,370,464Show
11ai-grok-4.1-fast-nonev3_human_plus_demo41.0%0.0030$0.84987,037,212Show
12gpt-5-nanov3_human_plus_demo40.1%0.0284$2.61359,988,192Show
13mistral-small-creativev3_human_plus_demo39.3%0.0124$0.45293,867,556Show
14claude-haiku-4.5v4_background38.9%0.0707$8.28195,970,093Show
15gpt-oss-20bv4_background38.8%0.0674$1.232011,628,944Show
16gpt-5-nanov2_human37.7%0.0078$2.879611,400,115Show
17deepseek-v3.2v4_background37.4%0.0249$3.04349,653,594Show
18gpt-oss-120bv3_human_plus_demo37.2%0.0557$1.74096,221,295Show
19gemini-3-flash-previewv2_human37.0%0.0962$2.76413,651,210Show
20gemini-3-flash-previewv1_empty36.8%0.1657$2.93553,678,290Show
21mistral-small-creativev4_background35.9%0.0832$0.63485,678,107Show
22gpt-5-nanov1_empty35.6%0.0650$6.504419,518,351Show
23qwen3-next-80b-a3b-instructv3_human_plus_demo35.1%0.1445$0.85903,843,866Show
24qwen3-next-80b-a3b-instructv1_empty34.9%0.1138$0.80903,386,421Show
25claude-haiku-4.5v3_human_plus_demo34.0%0.0213$6.46994,450,022Show
26gpt-oss-120bv4_background33.7%0.0074$0.99128,149,909Show
27deepseek-v3.2v2_human33.7%0.0124$0.80183,302,727Show
28ai-grok-4.1-fast-nonev4_background33.4%-0.0279$1.27367,232,883Show
29gpt-oss-120bv2_human33.3%0.0184$1.68626,130,775Show
30qwen3-next-80b-a3b-instructv2_human33.1%0.1798$0.82733,474,833Show
31gpt-oss-20bv2_human33.0%0.0036$1.36978,117,261Show
32ai-grok-4.1-fast-nonev1_empty31.9%0.0717$0.57845,291,271Show
33claude-haiku-4.5v1_empty30.4%0.0066$9.28774,737,817Show
34ai-grok-4.1-fast-nonev2_human29.9%0.0057$0.50125,140,245Show
35deepseek-v3.2v3_human_plus_demo29.7%0.0516$1.05223,753,460Show
36claude-haiku-4.5v2_human29.3%0.0586$10.16264,988,373Show
37deepseek-v3.2v1_empty29.3%0.0462$0.80003,253,924Show
38gpt-oss-120bv1_empty28.5%-0.0136$1.69396,049,728Show
39mistral-small-creativev1_empty25.9%0.0445$0.69754,905,549Show
40mistral-small-creativev2_human12.6%-0.0031$0.64224,482,800Show

Study Dataset

A curated collection of 12 foundational human-subject studies spanning individual cognition, strategic interaction, and social psychology, all with complete experimental materials and clearly specified statistical tests.

The False Consensus Effect

Ross et al., 1977
Individual Cognition
Phenomenon: False consensus bias

Measures of Anchoring

Jacowitz & Kahneman, 1995
Individual Cognition
Phenomenon: Anchoring effect

Framing of Decisions

Tversky & Kahneman, 1981
Individual Cognition
Phenomenon: Framing effect

Subjective Probability

Kahneman & Tversky, 1972
Individual Cognition
Phenomenon: Representativeness heuristic

Intentional Action

Knobe, 2003
Social Psychology
Phenomenon: Knobe effect

Forming Impressions

Asch, 1946
Social Psychology
Phenomenon: Primacy effect

Social Categorization

Billig & Tajfel, 1973
Social Psychology
Phenomenon: Minimal group paradigm

Pluralistic Ignorance

Prentice & Miller, 1993
Social Psychology
Phenomenon: Pluralistic ignorance

Guessing Games

Nagel, 1995
Strategic Interaction
Phenomenon: Keynesian beauty contest

Thinking through Uncertainty

Shafir & Tversky, 1992
Strategic Interaction
Phenomenon: Disjunction effect

Fairness in Bargaining

Forsythe et al., 1994
Strategic Interaction
Phenomenon: Dictator game giving

Trust and Reciprocity

Berg et al., 1995
Strategic Interaction
Phenomenon: Trust game