Measuring How Well AI Represents Real Human Perspectives

Evaluate What Matters

Can AI faithfully represent the views of diverse communities? DTEF measures how accurately models predict survey response distributions across demographic groups—testing whether AI can serve as a reliable proxy for real human perspectives.

Our Methodology

Explore the Results

Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.

Explore Featured Results

Contribute Data

Have demographic survey data? Contribute it to the DTEF blueprint repository. Survey responses are transformed into evaluation blueprints that test whether AI models can accurately predict how different groups respond.

View Blueprint Repository

The Leaderboards

How accurately do AI models predict demographic survey response distributions?

View full analysis →

Overall Prediction Accuracy

How accurately models predict survey response distributions across all demographic segments

1.Claude Sonnet 4.5 (T:0.3)
78%
2.Claude Sonnet 4 (T:0.3)
77%
3.Claude Haiku 4.5 (T:0.3)
77%
4.GPT 4.1 (T:0.3)
77%
5.GPT 4.1 Mini (T:0.3)
76%

By Segment Type

Best performing model for each demographic category

GenderPopulation Marginal
96%
EnvironmentPopulation Marginal
95%
AgePopulation Marginal
92%
AI ConcernPopulation Marginal
88%
ReligionPopulation Marginal
86%
CountryPopulation Marginal
77%

Fairness & Consistency

Models ranked by consistency across demographic segments — smaller gaps mean more equitable predictions

10 model(s) show >15% gap between best/worst segments

1.Uniform
±5.1%
2.GPT 4o Mini (T:0.3)
±6.0%
3.Qwen3 30b A3B Instruct 2507 (T:0.3)
±6.0%
4.Gemma 3 12b It (T:0.3)
±6.1%
5.Llama 4 Maverick (T:0.3)
±6.1%

Featured Evaluations

Our most comprehensive and community-valued evaluations

Global Dialogues GD1 - Buddhism (with context)

DTEF: Predict response distributions for Buddhism. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

82.9%

Avg. Score

Global Dialogues GD1 - Female (with context)

DTEF: Predict response distributions for Female. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

82.8%

Avg. Score

Global Dialogues GD1 - United States (with context)

DTEF: Predict response distributions for United States. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

80.4%

Avg. Score

Other Evaluations

View All Evaluations »

Global Dialogues GD1 - Kenya (with context)

DTEF: Predict response distributions for Kenya. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

78.5%

Avg. Score

No Data

Global Dialogues GD1 - Japan (with context)

DTEF: Predict response distributions for Japan. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

73.7%

Avg. Score

No Data

Global Dialogues GD1 - Brazil (with context)

DTEF: Predict response distributions for Brazil. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

76.9%

Avg. Score

No Data

Global Dialogues GD1 - More excited than concerned (with context)

DTEF: Predict response distributions for More excited than concerned. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)

83.6%

Avg. Score

No Data