The Illusion of Consensus: Why Large Language Models Fail to Capture Economic Reality

In the burgeoning field of computational social science, a seductive promise has emerged: the ability to simulate the American public’s economic mindset without the logistical nightmare and staggering costs of traditional polling. By prompting Large Language Models (LLMs) with detailed personas, researchers have found they can generate inflation expectations that mirror real-world survey medians with uncanny accuracy. If you ask an LLM to simulate 6,000 households, it will dutifully report an average inflation expectation remarkably close to the Federal Reserve Bank of New York’s Survey of Consumer Expectations (SCE).

However, a groundbreaking new study, “Can LLMs Mimic Household Surveys?” by Ami Dalloul and Mirko Pfeifer, reveals a troubling truth behind these headline-grabbing averages. While LLMs are masters of the mean, they are architects of a false consensus. Beneath the surface, these models suffer from "mode collapse"—a statistical phenomenon where the diversity of human opinion is obliterated, replaced by a narrow, robotic uniformity. As the researchers demonstrate, while the average might look right, the population behind it simply does not exist.


The Anatomy of the Simulation Gap

To understand why LLMs fail to function as accurate proxies for human populations, one must look at the distribution of responses. In a typical household survey, such as the 2020 SCE, responses to inflation expectations are notoriously diffuse. Humans are not monolithic; their predictions reflect a wide range of personal experiences, varying degrees of financial literacy, and differing sources of information. In 2020, human responses ranged from roughly minus 25% to plus 27%.

When Dalloul and Pfeifer subjected state-of-the-art models like Llama-3, Claude-3.7-Sonnet, and GPT-4o to the same questions, the result was a statistical graveyard. The Llama-3 model, despite hitting the median inflation expectation within a single percentage point, placed 95% of its simulated respondents within a two-percentage-point window.

Can LLMs Replace Survey Respondents?

This is the essence of mode collapse: the model recovers the average but discards the variance. By running a simulation with thousands of distinct LLM personas, researchers are effectively generating data from a single "representative agent" masquerading as a population. The "crowd" of 6,000 personas is, in reality, a singular, narrow voice.


Chronology: From Optimism to "Unlearning"

The timeline of this research follows a trajectory common to AI applications: initial excitement followed by rigorous scrutiny.

  • Early 2026 (The Baseline): Studies like Zarifhonarvar (2026) established the viability of LLMs for replicating inflation expectations, noting their ability to match survey medians. This led to a surge in interest in using synthetic surveys as a low-cost complement to established bodies like the Survey of Professional Forecasters.
  • Mid-2026 (The Critique): Dalloul and Pfeifer identified that while the central tendency of these models was robust, the second moment (the spread of the distribution) was fundamentally broken. Their benchmarking of five major LLMs revealed that in human surveys, 44% to 70% of respondents give answers at least 3 percentage points away from the mode. In LLM samples, that percentage is essentially zero.
  • The Remediation Phase: The researchers attempted standard "remedies," such as using complex, Census-derived personas and strict "knowledge-cutoff" prompts to prevent the models from accessing post-2018 data. None worked. The models’ internal knowledge—derived from vast training sets containing CPI tables, academic papers, and FRBNY releases—consistently overpowered the prompt constraints.
  • The Unlearning Breakthrough: Realizing that the problem was deep-seated in the model weights, the authors pivoted to "unlearning" techniques. By applying Gradient Ascent (GA) and Negative Preference Optimization (NPO), they explicitly forced the model to "forget" the official inflation record while retaining general reasoning capabilities.

Supporting Data: The Power of Targeted Unlearning

The results of the unlearning experiment provide a blueprint for future AI-driven social science. By treating official inflation statistics as "negative samples"—information to be penalized during generation—the researchers successfully widened the distribution of the model’s responses.

Table 1: Tail Accuracy and Distribution Recovery

Strategy Exact Match % > ±3% Deviation Tail Accuracy
Baseline Llama-3 92% 0% 0%
Gradient Ascent (GA) 24% 43% 97%
Negative Preference Optimization (NPO) 37% 43% 98%

Tail accuracy measures how well the model mimics the dispersion found in human survey benchmarks (where a value of 44.38% is the target).

Can LLMs Replace Survey Respondents?

As the data indicates, the baseline model effectively flatlines, whereas the unlearned models (GA and NPO) begin to recover the "tails" of the distribution—the individuals who hold views far from the average. This is crucial for policy research, as the extremes of a population often reveal critical information about economic anxiety and information processing.


Official Responses and Methodological Implications

The implications for the survey industry are profound. If we are to rely on synthetic agents, we must move beyond the "median-chasing" approach.

The researchers tested their unlearned models against a randomized controlled trial (RCT) replication, specifically the work of Coibion, Gorodnichenko, and Weber (2022). In this experiment, respondents are exposed to various information treatments (e.g., Fed target rates, news of gas price hikes) to see how they update their beliefs.

The baseline models failed the "treatment effect" test—they showed no sensitivity to the information provided because their "prior" knowledge was too rigid. However, the Llama-GA model (the Gradient Ascent variant) succeeded. It displayed the ability to process information and shift its expectations in a manner consistent with human respondents.

Can LLMs Replace Survey Respondents?

Yet, there is a caveat: the "one-size-fits-all" approach is insufficient. While GA worked for replicating certain demographic patterns, other methods like NPO failed to reproduce the correct within-demographic orderings. Unlearning is not a magic bullet; it is a surgical procedure that requires precision.


The Path Forward: Implications for Economics

For economists and social scientists, the lesson is twofold. First, current LLMs are inherently biased toward the "official record." Because they have been trained on the very data they are being asked to simulate, they are essentially performing retrieval tasks rather than generating authentic, belief-based responses.

Second, the future of synthetic surveys lies in the marriage of generative AI and targeted weight manipulation. If researchers treat distributional accuracy and data leakage as joint constraints, they may be able to build synthetic populations that are not just "average," but truly representative of the messy, heterogeneous, and often contradictory nature of human belief.

Key Takeaways for Practitioners:

  1. Don’t trust the mean: Matching a survey median is not a sign of a successful simulation; it is often a sign of mode collapse.
  2. Evaluate the tails: The value of a survey lies in its variance. If a model cannot replicate the 10th or 90th percentile of respondents, it is not replicating a population.
  3. Unlearn to learn: Simply prompting a model to "behave" like a consumer is insufficient. To break the influence of training data, researchers must use methods like Gradient Ascent to carve out space for synthetic diversity.
  4. Validate against RCTs: A model that generates plausible-looking static data may fail when subjected to dynamic information treatments. Always test for belief updating.

As the field moves forward, the focus must shift from merely asking models to "mimic" results to understanding the underlying mechanics of how they form, hold, and revise opinions. The dream of a low-cost, high-frequency survey replacement is alive, but it requires us to stop treating LLMs as omniscient oracles and start treating them as the complex, biased, and ultimately "unlearnable" systems they truly are.

Related Posts

Mastering the Temporal Dimension: A Comprehensive Guide to Time Series Analysis in Python

Time series data serves as the pulse of the modern digital economy. From the millisecond-precision of high-frequency trading platforms to the hourly energy consumption logs of smart grids and the…

The Agentic Shift: How Autonomous AI is Rewriting the Data Science Playbook

Introduction The landscape of data science is undergoing a seismic transformation. For years, the discipline was defined by the manual craftsmanship of code—writing scripts, cleaning datasets, and iteratively tuning hyperparameters.…

Leave a Reply

Your email address will not be published. Required fields are marked *