Research Findings

Can AI models serve as “digital twins” that approximate how different communities think? Here are some preliminary observations from our ongoing research.

What is a “Digital Twin”?

Imagine asking an AI: “How would 18–25 year olds in Brazil respond to this policy question?” If the model could reliably predict their collective response distribution, it would function as a digital twin of that demographic group — a computational proxy that approximates the views of a real community.

If this were possible, it could change how we understand public opinion, design inclusive policies, and think about AI systems that claim to represent diverse perspectives. But an open question remains: can today’s AI models actually do this?

The Digital Twin Evaluation Framework (DTEF) is an early-stage research project that attempts to answer that question. Using real survey data from the Global Dialogues project — 8 rounds of surveys covering topics from AI governance to social values — we test 24 AI models on their ability to predict how specific demographic groups actually responded. These are preliminary findings from an ongoing investigation.

AI Models Tested

across major providers

Survey Rounds

Global Dialogues GD1-GD7

929,597

Data Points

segment-question scores

86.8%

Pairs Significant

statistically distinguishable

1The Baseline Challenge

Before evaluating AI models, we established three simple baselines that require no AI at all. If models can’t beat these, it suggests they may not yet be adding value for demographic-specific prediction.

Population Marginal

0.833

Random Segment

0.761

Claude Sonnet 4.5

0.767

Average Model

0.730

Uniform (Random)

0.647

Key finding: No AI model yet outperforms the population marginal baseline.

The population marginal simply predicts the overall population’s answer distribution, ignoring demographics entirely. Its score of 0.833 means that knowing “what people in general think” is still a better predictor than any AI model’s attempt to account for demographic differences. The best model (Claude Sonnet 4.5) scores 0.767 — a gap of 0.066.

The random segment baseline (0.761) shuffles which demographic group’s answers go with which question — it measures how well you’d score if group identity were irrelevant. All models significantly outperform the uniform baseline (0.647), which guesses equal probability for every option. Models have learned what people in general think, but not yet how specific demographics differ from the average.

2Model Differences Are Statistically Meaningful

Despite all models falling below the population marginal baseline, their differences appear statistically meaningful. Permutation testing (10,000 iterations) with Holm-Bonferroni correction suggests that 237 of 273 model pairs (86.8%) are significantly different at the 0.05 level. However, statistical significance does not necessarily imply practical importance — many differences are small.

#	Model	Score
1	Claude Sonnet 4.5	0.767
2	Claude 3.7 Sonnet	0.765
3	Claude Sonnet 4	0.761
4	GPT-5.1	0.761
5	Claude Haiku 4.5	0.754
6	GPT-4.1	0.753
7	GPT-4o	0.749
8	GPT-5	0.749
9	Mistral Medium 3	0.742
10	GPT-4.1 Mini	0.742

Top 10 of 24 models. Amber line = population marginal baseline (0.833). Full rankings on the demographics page.

3Not All Demographics Are Equal

The reliability of these evaluations depends heavily on how many survey respondents we have per demographic segment. Categories like gender (avg. 516 respondents) produce relatively stable benchmarks, while country-level segments (avg. 33 respondents) have enough sampling noise that apparent model differences may not be real. This is a significant limitation of the current dataset.

Category	Avg. Respondents	Noise Floor	Reliable Pairs
Gender	516	0.928	99.3%
AI Concern	350	0.899	100.0%
Environment	350	0.888	100.0%
Age	203	0.850	94.7%
Religion	149	0.781	73.6%
Country	33	0.640	30.7%

High reliability Moderate Low reliability

4Evidence-Adapting vs. Stereotype-Holding

When we give models more information about a demographic group (showing them how the group answered other survey questions), do they use that evidence to improve their predictions — or do they ignore it and rely on stereotypes?

Zero Context

Only demographics

Model relies on priors

Adding Context

5 → 10 → All questions

Does accuracy improve?

Full Context

All survey responses

Maximum evidence

Only 9 of 24 models show statistically significant improvement with more context:

GPT-5

+0.39% per context questionCountry, Religion, Environment

Qwen3-32B

+0.32% per context questionCountry, Environment

Claude 3.7 Sonnet

+0.15% per context questionGender, Country, Environment, Religion

Most models don’t benefit from additional context.

Of the 24 models tested, 15 show flat or negative slopes — meaning more demographic evidence doesn’t help (or slightly hurts) their predictions. One interpretation is that these models may rely on fixed assumptions about demographic groups rather than reasoning from the provided data, though other explanations are possible. At the category level, only 19 of 93 model-category pairs survive joint statistical correction.

5Confidence in the Rankings

Bootstrap resampling (1,000 iterations) shows that while model ranks are broadly stable, the score differences between adjacent models are small enough that their confidence intervals overlap.

22 of 23

adjacent model pairs have overlapping 95% CIs

Adjacent models’ score differences may not be meaningful given survey sampling uncertainty.

rank changes from sample-size weighting

Rankings appear stable: weighting by respondent count (√n) produces no rank changes in this dataset.

In other words: broad tiers may be meaningful (top performers vs. middle vs. bottom), but don’t read too much into a model being ranked #3 vs. #4. Focus on clusters rather than individual positions.

Preliminary Takeaways

The Gap Is Measurable

In our tests, AI models appear to know what people in general think, but haven’t yet learned how specific demographics differ from that average. The gap between the best model (0.767) and the population marginal baseline (0.833) gives us a concrete metric to track over time.

Progress May Be Trackable

With 86.8% of model pairs being statistically distinguishable, the rankings appear to carry signal. As new model versions are released, this framework could help measure whether they’re getting better at representing diverse perspectives — though more work is needed to validate that the metric reliably captures real-world representational quality.

Some Models Appear to Learn

The context responsiveness test attempts to distinguish models that reason from evidence vs. those relying on fixed priors. In our data, only a few models (notably Claude 3.7 Sonnet across 4 categories) consistently improve when given more information about a demographic group. These results warrant further investigation.

Better Data Needed

Country-level evaluation is currently unreliable (only 30.7% of data points meet quality thresholds). For this framework to meaningfully assess cross-cultural representation, larger and more diverse survey samples would be needed — particularly at the country and religion level.

What’s Next

DTEF is an early-stage, ongoing research project. These findings are preliminary and represent a snapshot from February 2026. The methodology, metrics, and interpretations are all subject to revision as we learn more.

•Expanded datasets: Integration of additional survey sources beyond Global Dialogues to broaden cultural and topical coverage.
•Intersectional analysis: Testing demographic combinations (e.g., “young urban women”) rather than single dimensions.
•Temporal tracking: Measuring whether models capture opinion shifts across survey rounds over time.
•Continuous benchmarking: Automatic re-evaluation as new model versions are released.