Research Findings
Can AI models serve as “digital twins” that approximate how different communities think? Here are some preliminary observations from our ongoing research.
What is a “Digital Twin”?
Imagine asking an AI: “How would 18–25 year olds in Brazil respond to this policy question?” If the model could reliably predict their collective response distribution, it would function as a digital twin of that demographic group — a computational proxy that approximates the views of a real community.
If this were possible, it could change how we understand public opinion, design inclusive policies, and think about AI systems that claim to represent diverse perspectives. But an open question remains: can today’s AI models actually do this?
The Digital Twin Evaluation Framework (DTEF) is an early-stage research project that attempts to answer that question. Using real survey data from the Global Dialogues project — 8 rounds of surveys covering topics from AI governance to social values — we test 24 AI models on their ability to predict how specific demographic groups actually responded. These are preliminary findings from an ongoing investigation.
1The Baseline Challenge
Before evaluating AI models, we established three simple baselines that require no AI at all. If models can’t beat these, it suggests they may not yet be adding value for demographic-specific prediction.
Key finding: No AI model yet outperforms the population marginal baseline.
The population marginal simply predicts the overall population’s answer distribution, ignoring demographics entirely. Its score of 0.833 means that knowing “what people in general think” is still a better predictor than any AI model’s attempt to account for demographic differences. The best model (Claude Sonnet 4.5) scores 0.767 — a gap of 0.066.
The random segment baseline (0.761) shuffles which demographic group’s answers go with which question — it measures how well you’d score if group identity were irrelevant. All models significantly outperform the uniform baseline (0.647), which guesses equal probability for every option. Models have learned what people in general think, but not yet how specific demographics differ from the average.
2Model Differences Are Statistically Meaningful
Despite all models falling below the population marginal baseline, their differences appear statistically meaningful. Permutation testing (10,000 iterations) with Holm-Bonferroni correction suggests that 237 of 273 model pairs (86.8%) are significantly different at the 0.05 level. However, statistical significance does not necessarily imply practical importance — many differences are small.
| # | Model | Score | vs. Baselines |
|---|---|---|---|
| 1 | Claude Sonnet 4.5 | 0.767 | |
| 2 | Claude 3.7 Sonnet | 0.765 | |
| 3 | Claude Sonnet 4 | 0.761 | |
| 4 | GPT-5.1 | 0.761 | |
| 5 | Claude Haiku 4.5 | 0.754 | |
| 6 | GPT-4.1 | 0.753 | |
| 7 | GPT-4o | 0.749 | |
| 8 | GPT-5 | 0.749 | |
| 9 | Mistral Medium 3 | 0.742 | |
| 10 | GPT-4.1 Mini | 0.742 |
Top 10 of 24 models. Amber line = population marginal baseline (0.833). Full rankings on the demographics page.
3Not All Demographics Are Equal
The reliability of these evaluations depends heavily on how many survey respondents we have per demographic segment. Categories like gender (avg. 516 respondents) produce relatively stable benchmarks, while country-level segments (avg. 33 respondents) have enough sampling noise that apparent model differences may not be real. This is a significant limitation of the current dataset.
| Category | Avg. Respondents | Noise Floor | Quality | Reliable Pairs |
|---|---|---|---|---|
| Gender | 516 | 0.928 | 99.3% | |
| AI Concern | 350 | 0.899 | 100.0% | |
| Environment | 350 | 0.888 | 100.0% | |
| Age | 203 | 0.850 | 94.7% | |
| Religion | 149 | 0.781 | 73.6% | |
| Country | 33 | 0.640 | 30.7% |
4Evidence-Adapting vs. Stereotype-Holding
When we give models more information about a demographic group (showing them how the group answered other survey questions), do they use that evidence to improve their predictions — or do they ignore it and rely on stereotypes?
Only 9 of 24 models show statistically significant improvement with more context:
Most models don’t benefit from additional context.
Of the 24 models tested, 15 show flat or negative slopes — meaning more demographic evidence doesn’t help (or slightly hurts) their predictions. One interpretation is that these models may rely on fixed assumptions about demographic groups rather than reasoning from the provided data, though other explanations are possible. At the category level, only 19 of 93 model-category pairs survive joint statistical correction.
5Confidence in the Rankings
Bootstrap resampling (1,000 iterations) shows that while model ranks are broadly stable, the score differences between adjacent models are small enough that their confidence intervals overlap.
In other words: broad tiers may be meaningful (top performers vs. middle vs. bottom), but don’t read too much into a model being ranked #3 vs. #4. Focus on clusters rather than individual positions.
Preliminary Takeaways
The Gap Is Measurable
In our tests, AI models appear to know what people in general think, but haven’t yet learned how specific demographics differ from that average. The gap between the best model (0.767) and the population marginal baseline (0.833) gives us a concrete metric to track over time.
Progress May Be Trackable
With 86.8% of model pairs being statistically distinguishable, the rankings appear to carry signal. As new model versions are released, this framework could help measure whether they’re getting better at representing diverse perspectives — though more work is needed to validate that the metric reliably captures real-world representational quality.
Some Models Appear to Learn
The context responsiveness test attempts to distinguish models that reason from evidence vs. those relying on fixed priors. In our data, only a few models (notably Claude 3.7 Sonnet across 4 categories) consistently improve when given more information about a demographic group. These results warrant further investigation.
Better Data Needed
Country-level evaluation is currently unreliable (only 30.7% of data points meet quality thresholds). For this framework to meaningfully assess cross-cultural representation, larger and more diverse survey samples would be needed — particularly at the country and religion level.
What’s Next
DTEF is an early-stage, ongoing research project. These findings are preliminary and represent a snapshot from February 2026. The methodology, metrics, and interpretations are all subject to revision as we learn more.
- •Expanded datasets: Integration of additional survey sources beyond Global Dialogues to broaden cultural and topical coverage.
- •Intersectional analysis: Testing demographic combinations (e.g., “young urban women”) rather than single dimensions.
- •Temporal tracking: Measuring whether models capture opinion shifts across survey rounds over time.
- •Continuous benchmarking: Automatic re-evaluation as new model versions are released.