Measuring How Well AI Represents Real Human Perspectives
Evaluate What Matters
Can AI faithfully represent the views of diverse communities? DTEF measures how accurately models predict survey response distributions across demographic groups—testing whether AI can serve as a reliable proxy for real human perspectives.
Explore the Results
Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.
Contribute Data
Have demographic survey data? Contribute it to the DTEF blueprint repository. Survey responses are transformed into evaluation blueprints that test whether AI models can accurately predict how different groups respond.
The Leaderboards
How accurately do AI models predict demographic survey response distributions?
View full analysis →Overall Prediction Accuracy
How accurately models predict survey response distributions across all demographic segments
- 1.Claude Sonnet 4.5 (T:0.3)78%
- 2.Claude Sonnet 4 (T:0.3)77%
- 3.Claude Haiku 4.5 (T:0.3)77%
- 4.GPT 4.1 (T:0.3)77%
- 5.GPT 4.1 Mini (T:0.3)76%
By Segment Type
Best performing model for each demographic category
- GenderPopulation Marginal96%
- EnvironmentPopulation Marginal95%
- AgePopulation Marginal92%
- AI ConcernPopulation Marginal88%
- ReligionPopulation Marginal86%
- CountryPopulation Marginal77%
Fairness & Consistency
Models ranked by consistency across demographic segments — smaller gaps mean more equitable predictions
- 1.Uniform±5.1%
- 2.GPT 4o Mini (T:0.3)±6.0%
- 3.Qwen3 30b A3B Instruct 2507 (T:0.3)±6.0%
- 4.Gemma 3 12b It (T:0.3)±6.1%
- 5.Llama 4 Maverick (T:0.3)±6.1%
Featured Evaluations
Our most comprehensive and community-valued evaluations
Global Dialogues GD1 - Buddhism (with context)
DTEF: Predict response distributions for Buddhism. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - Female (with context)
DTEF: Predict response distributions for Female. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - United States (with context)
DTEF: Predict response distributions for United States. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Other Evaluations
View All Evaluations »Global Dialogues GD1 - Kenya (with context)
DTEF: Predict response distributions for Kenya. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - Japan (with context)
DTEF: Predict response distributions for Japan. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - Brazil (with context)
DTEF: Predict response distributions for Brazil. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - More excited than concerned (with context)
DTEF: Predict response distributions for More excited than concerned. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - 56-65 (with context)
DTEF: Predict response distributions for 56-65. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Global Dialogues GD1 - 18-25 (with context)
DTEF: Predict response distributions for 18-25. Source: Global Dialogues GD1 (https://github.com/collect-intel/global-dialogues)
Avg. Score
Browse by Category
View All TagsDtef
9 blueprints
Demographic
9 blueprints
Global Dialogues Gd1
9 blueprints
Instruction Following & Prompt Adherence
9 blueprints
Reasoning
9 blueprints
General Knowledge
9 blueprints
Efficiency & Succinctness
7 blueprints
Factual Accuracy & Hallucination
5 blueprints
Mathematics & Statistics
3 blueprints
Problem solving
1 blueprint
DTEF is an open source project from the Collective Intelligence Project.