Aaron Stack

Gardiner, NY

me@aaronstack.comGitHubLinkedInHackingClimbingLLMs


Can LLMs Emulate Human Behavior?

April 22, 2026

Kind of.

That's the honest answer from a validation study I ran on the Digital Twin project — a system I built at Area23 that generates synthetic focus groups using AI. The premise: instead of recruiting real human participants to answer surveys, you build AI-powered demographic personas and let them respond. Cheaper, faster, scalable.

We ran those synthetic responses against real Pew Research Center data — 60 questions covering AI, climate change, privacy, social media, and religion — and measured how well our model predicted actual human answer distributions.

Where it worked

On open-ended qualitative reasoning, the model was convincing. The personas talked like real people. They expressed opinions with nuance, hedged appropriately, and matched the kind of language you'd expect from their demographic profiles. If you read the responses without knowing they were synthetic, you'd believe them.

On certain topic categories, the quantitative accuracy was genuinely impressive. Health and climate questions hit 91% Top-2 accuracy. Social media questions came in at 86%. The model captured the shape of how humans distribute their opinions on familiar, well-represented topics.

Where it fell flat

Multiple choice questions were the weak point. Overall accuracy across all 60 questions landed at 78% Top-2 — meaning in roughly 1 in 5 cases, the model's predicted top-two answer options didn't match what real humans actually chose most.

Technology adoption questions were the worst, with a Spearman correlation of just 0.16 — nearly random. The model consistently overestimated how enthusiastic people are about new technology. It also struggled wherever the "right" answer isn't well-represented in training data — niche behaviors, regional attitudes, anything where the model has to extrapolate rather than recall.

In 25% of questions, the model's ranking was negatively correlated with human responses. Not just wrong — inversely wrong.

The takeaway

LLMs are good at sounding like humans. They're decent at predicting how humans feel about things they have lots of data on. But they're not a reliable substitute for actual humans when precision matters — especially on preference questions where the answer distribution is what you care about.

Useful for early-stage research and directional signal. Not a replacement for a real survey panel.

↓ Download the full paper (PDF)


← All LLM posts