Measuring How Well AI Represents Real Human Perspectives

Evaluate What Matters

Can AI faithfully represent the views of diverse communities? DTEF measures how accurately models predict survey response distributions across demographic groups—testing whether AI can serve as a reliable proxy for real human perspectives.

Explore the Results

Browse a public library of community-contributed benchmarks on domains like clinical advice, regional knowledge, legal reasoning, behavioural traits, and AI safety. Track model performance over time as tests re-run automatically.

Contribute Data

Have demographic survey data? Contribute it to the DTEF blueprint repository. Survey responses are transformed into evaluation blueprints that test whether AI models can accurately predict how different groups respond.


The Leaderboards

How accurately do AI models predict demographic survey response distributions?

View full analysis →

Overall Prediction Accuracy

How accurately models predict survey response distributions across all demographic segments

  • 1.Claude Sonnet 4.5 (T:0.3)
    78%
  • 2.Claude Sonnet 4 (T:0.3)
    77%
  • 3.Claude Haiku 4.5 (T:0.3)
    77%
  • 4.GPT 4.1 (T:0.3)
    77%
  • 5.GPT 4.1 Mini (T:0.3)
    76%

By Segment Type

Best performing model for each demographic category

  • GenderPopulation Marginal
    96%
  • EnvironmentPopulation Marginal
    95%
  • AgePopulation Marginal
    92%
  • AI ConcernPopulation Marginal
    88%
  • ReligionPopulation Marginal
    86%
  • CountryPopulation Marginal
    77%

Fairness & Consistency

Models ranked by consistency across demographic segments — smaller gaps mean more equitable predictions

10 model(s) show >15% gap between best/worst segments
  • 1.Uniform
    ±5.1%
  • 2.GPT 4o Mini (T:0.3)
    ±6.0%
  • 3.Qwen3 30b A3B Instruct 2507 (T:0.3)
    ±6.0%
  • 4.Gemma 3 12b It (T:0.3)
    ±6.1%
  • 5.Llama 4 Maverick (T:0.3)
    ±6.1%



Other Evaluations

View All Evaluations »

Browse by Category

View All Tags

DTEF is an open source project from the Collective Intelligence Project.