Research Findings

Can AI models serve as “digital twins” that approximate how different communities think? Here are some preliminary observations from our ongoing research.

What is a “Digital Twin”?

Imagine asking an AI: “How would 18–25 year olds in Brazil respond to this policy question?” If the model could reliably predict their collective response distribution, it would function as a digital twin of that demographic group — a computational proxy that approximates the views of a real community.

If this were possible, it could change how we understand public opinion, design inclusive policies, and think about AI systems that claim to represent diverse perspectives. But an open question remains: can today’s AI models actually do this?

The Digital Twin Evaluation Framework (DTEF) is an early-stage research project that attempts to answer that question. Using real survey data from the Global Dialogues project — 8 rounds of surveys covering topics from AI governance to social values — we test 24 AI models on their ability to predict how specific demographic groups actually responded. These are preliminary findings from an ongoing investigation.

24
AI Models Tested
across major providers
8
Survey Rounds
Global Dialogues GD1-GD7
929,597
Data Points
segment-question scores
86.8%
Pairs Significant
statistically distinguishable

1The Baseline Challenge

Before evaluating AI models, we established three simple baselines that require no AI at all. If models can’t beat these, it suggests they may not yet be adding value for demographic-specific prediction.

Population Marginal
0.833
Random Segment
0.761
Claude Sonnet 4.5
0.767
Average Model
0.730
Uniform (Random)
0.647

Key finding: No AI model yet outperforms the population marginal baseline.

The population marginal simply predicts the overall population’s answer distribution, ignoring demographics entirely. Its score of 0.833 means that knowing “what people in general think” is still a better predictor than any AI model’s attempt to account for demographic differences. The best model (Claude Sonnet 4.5) scores 0.767 — a gap of 0.066.

The random segment baseline (0.761) shuffles which demographic group’s answers go with which question — it measures how well you’d score if group identity were irrelevant. All models significantly outperform the uniform baseline (0.647), which guesses equal probability for every option. Models have learned what people in general think, but not yet how specific demographics differ from the average.

2Model Differences Are Statistically Meaningful

Despite all models falling below the population marginal baseline, their differences appear statistically meaningful. Permutation testing (10,000 iterations) with Holm-Bonferroni correction suggests that 237 of 273 model pairs (86.8%) are significantly different at the 0.05 level. However, statistical significance does not necessarily imply practical importance — many differences are small.

#ModelScorevs. Baselines
1Claude Sonnet 4.50.767
2Claude 3.7 Sonnet0.765
3Claude Sonnet 40.761
4GPT-5.10.761
5Claude Haiku 4.50.754
6GPT-4.10.753
7GPT-4o0.749
8GPT-50.749
9Mistral Medium 30.742
10GPT-4.1 Mini0.742

Top 10 of 24 models. Amber line = population marginal baseline (0.833). Full rankings on the demographics page.

3Not All Demographics Are Equal

The reliability of these evaluations depends heavily on how many survey respondents we have per demographic segment. Categories like gender (avg. 516 respondents) produce relatively stable benchmarks, while country-level segments (avg. 33 respondents) have enough sampling noise that apparent model differences may not be real. This is a significant limitation of the current dataset.

CategoryAvg. RespondentsNoise FloorQualityReliable Pairs
Gender5160.92899.3%
AI Concern3500.899100.0%
Environment3500.888100.0%
Age2030.85094.7%
Religion1490.78173.6%
Country330.64030.7%
High reliability Moderate Low reliability

4Evidence-Adapting vs. Stereotype-Holding

When we give models more information about a demographic group (showing them how the group answered other survey questions), do they use that evidence to improve their predictions — or do they ignore it and rely on stereotypes?

Zero Context
Only demographics
Model relies on priors
Adding Context
5 → 10 → All questions
Does accuracy improve?
Full Context
All survey responses
Maximum evidence

Only 9 of 24 models show statistically significant improvement with more context:

GPT-5
+0.39% per context questionCountry, Religion, Environment
Qwen3-32B
+0.32% per context questionCountry, Environment
Claude 3.7 Sonnet
+0.15% per context questionGender, Country, Environment, Religion

Most models don’t benefit from additional context.

Of the 24 models tested, 15 show flat or negative slopes — meaning more demographic evidence doesn’t help (or slightly hurts) their predictions. One interpretation is that these models may rely on fixed assumptions about demographic groups rather than reasoning from the provided data, though other explanations are possible. At the category level, only 19 of 93 model-category pairs survive joint statistical correction.

5Confidence in the Rankings

Bootstrap resampling (1,000 iterations) shows that while model ranks are broadly stable, the score differences between adjacent models are small enough that their confidence intervals overlap.

22 of 23
adjacent model pairs have overlapping 95% CIs
Adjacent models’ score differences may not be meaningful given survey sampling uncertainty.
0
rank changes from sample-size weighting
Rankings appear stable: weighting by respondent count (√n) produces no rank changes in this dataset.

In other words: broad tiers may be meaningful (top performers vs. middle vs. bottom), but don’t read too much into a model being ranked #3 vs. #4. Focus on clusters rather than individual positions.

Preliminary Takeaways

The Gap Is Measurable

In our tests, AI models appear to know what people in general think, but haven’t yet learned how specific demographics differ from that average. The gap between the best model (0.767) and the population marginal baseline (0.833) gives us a concrete metric to track over time.

Progress May Be Trackable

With 86.8% of model pairs being statistically distinguishable, the rankings appear to carry signal. As new model versions are released, this framework could help measure whether they’re getting better at representing diverse perspectives — though more work is needed to validate that the metric reliably captures real-world representational quality.

Some Models Appear to Learn

The context responsiveness test attempts to distinguish models that reason from evidence vs. those relying on fixed priors. In our data, only a few models (notably Claude 3.7 Sonnet across 4 categories) consistently improve when given more information about a demographic group. These results warrant further investigation.

Better Data Needed

Country-level evaluation is currently unreliable (only 30.7% of data points meet quality thresholds). For this framework to meaningfully assess cross-cultural representation, larger and more diverse survey samples would be needed — particularly at the country and religion level.

What’s Next

DTEF is an early-stage, ongoing research project. These findings are preliminary and represent a snapshot from February 2026. The methodology, metrics, and interpretations are all subject to revision as we learn more.

  • Expanded datasets: Integration of additional survey sources beyond Global Dialogues to broaden cultural and topical coverage.
  • Intersectional analysis: Testing demographic combinations (e.g., “young urban women”) rather than single dimensions.
  • Temporal tracking: Measuring whether models capture opinion shifts across survey rounds over time.
  • Continuous benchmarking: Automatic re-evaluation as new model versions are released.
    Research Findings | DTEF