Large language models arrived not as the robots of science fiction — not as metallic intelligences making demands or exhibiting obvious alienness — but as text boxes. A prompt field, a submit button, a response. The disorientation they produced was proportional to the ordinariness of the interface: the same form factor as a search engine, the same interaction model as a chat application, and yet outputs that passed bar examinations, generated publishable code, diagnosed rare diseases from symptom descriptions, and sustained philosophical conversations that many participants experienced as genuine engagement. The technology arrived in the form of a tool and behaved, in certain moments and to certain observers, like something else entirely.

The central difficulty in thinking clearly about large language models is that the mechanism and the effect are profoundly mismatched. The mechanism is statistical: next-token prediction over training corpora of extraordinary scale, producing outputs by sampling from learned probability distributions over language. The effect is outputs that challenge categories — authorship, reasoning, creativity, understanding — that human beings have long used to define themselves against the non-human. This essay argues that LLMs are not mirrors of human intelligence but they are mirrors of human language — and because so much of what we call thinking is expressed in, shaped by, and sometimes constituted by language, what they reflect back is something genuinely confronting. Not because they are intelligent, but because they are forcing an encounter with how much of what we called intelligence was, all along, pattern.


The Mechanism: Compression and Decompression

A large language model is, at its technical foundation, a transformer: a neural network architecture that processes sequences of tokens and learns, through exposure to vast quantities of human-generated text, to predict what token is most likely to follow any given sequence. The training objective — next-token prediction — is deceptively simple. Its consequence, applied at sufficient scale to sufficient data, is a model that has internalised an extraordinarily rich statistical map of human language: which words follow which in which contexts, which arguments accompany which conclusions, which framings precede which kinds of responses, across every genre, domain, and register present in the training data.

The resulting capability is not understanding in any philosophically robust sense. The model does not know that Paris is in France the way a person knows it — with associated memories, geographic intuitions, and embodied experience. It knows it in the sense that the token sequence "Paris is in" is overwhelmingly followed by "France" in the distribution it has learned, and that this statistical regularity is entangled with thousands of other regularities in a structure of mutual reinforcement. The model has no beliefs, no intentions, and no referential relationship to the world. It has compressed patterns.

This framing — the LLM as a compression of human language — is precise and useful. Pre-training encodes the statistical structure of an internet-scale corpus into billions of parameters. Inference is decompression: given a prompt, the model samples from the compressed structure to produce a continuation. The output is not retrieved from memory in the way a database retrieves a record; it is generated, token by token, from the parametric structure that training produced. Fine-tuning and reinforcement learning from human feedback subsequently shape the model's output distribution toward responses that human raters judge as helpful, accurate, and safe — but the foundation remains the compressed statistical regularities of human language at scale. What comes out of an LLM is, in a genuine sense, distilled humanity's own patterns returned to it in response to its questions.


The Reasoning Question: Emergence, Fluency, and the Hallucination Signal

The most contested question in contemporary AI research is whether large language models reason. The empirical record is genuinely ambiguous. LLMs demonstrate apparently coherent multi-step problem solving, mathematical derivation, analogical inference, and logical argument construction. Chain-of-thought prompting — encouraging the model to produce intermediate reasoning steps before arriving at a conclusion — substantially improves performance on reasoning benchmarks. These capabilities emerged at scale without being explicitly trained for: they were not programmed as distinct functions but appeared as properties of sufficiently large models trained on sufficiently large corpora.

The hallucination problem is the most honest signal available for evaluating what is actually happening. An LLM that confidently states a false fact, invents a non-existent legal case, or attributes a fabricated quotation to a real person is not malfunctioning — it is functioning exactly as designed. Its objective is to produce a plausible continuation of the prompt, and plausibility is assessed against the statistical distribution of its training data, not against the truth of the world. When the most statistically probable continuation of a sequence happens to be false, the model produces the false continuation with exactly the same confidence it would produce a true one. Hallucination is not a bug introduced by insufficient training. It is a structural consequence of a system optimised for plausibility rather than truth.

This is the sense in which the reasoning analogy is ultimately incomplete. A system that reasons in the philosophical sense has a relationship to truth — its outputs are constrained by logical validity and empirical accuracy. An LLM has a relationship to plausibility — its outputs are constrained by what human language, in the distribution represented by its training data, looks like when it discusses the relevant topic. These are not the same constraint. They produce similar outputs in the vast majority of cases — most plausible statements are also true — and diverge precisely in the cases where the distinction matters most: novel problems, edge cases, and questions where training data is sparse, biased, or absent.


The Reshaping: Three Domains of Genuine Disruption

The disruption that large language models are producing is real, but its nature requires precise characterisation. Three domains deserve particular attention.

The first is knowledge work. Tasks that previously required years of professional training — drafting legal documents, writing production code, synthesising medical literature, constructing financial models — can now be substantially assisted, accelerated, or in some cases performed by LLM-based tools. The compression is asymmetric: an experienced professional using these tools can accomplish in an hour what previously took a day; a novice using them can produce outputs that approximate, in surface form, the work of an experienced professional. The economic and professional consequences of this compression are still unfolding, but their direction is consistent: the value of knowledge that is widely distributed in human language is declining, while the value of judgement, verification, and the ability to evaluate LLM outputs critically is increasing.

The second domain is epistemology. The saturation of information environments with synthetically generated content — text, images, audio, video — is degrading the provenance of information at a structural level. When any piece of content may have been generated by a model trained on other content, the inferential chain from content to credibility becomes unreliable in ways it was not before. The collapse of authorship certainty — the inability to determine, from the text alone, whether a human or a model wrote it — undermines epistemic practices that depend on source attribution, accountability, and the asymmetry between the effort required to produce genuine expertise and the effort required to simulate it.

The third domain is identity. Humans have long defined themselves, in part, by contrast with the non-human: as the beings that use language, that create, that reason, that feel. Large language models do not do these things in the way humans do them — the mechanism is categorically different. But the outputs are, in many contexts, functionally indistinguishable. This is not a demonstration that machines are persons. It is a demonstration that the outputs humans attributed to uniquely human capacities can be produced by processes that possess none of those capacities — and this finding has implications for how those capacities are understood, valued, and cultivated.


Counter-Argument: The Stochastic Parrot and the Anthropomorphism Risk

The most substantive critical position on LLMs holds that they are, in essence, sophisticated autocomplete: systems that mimic the surface patterns of understanding without possessing any of its substance. The stochastic parrot critique — that LLMs generate plausible-sounding text by remixing patterns from training data without any grounding in meaning, reference, or world-knowledge — is mechanistically accurate. The model does not know what it is saying in the sense that a knowledgeable person knows what they are saying. The risk of treating it as if it does — of anthropomorphising outputs into evidence of understanding, sentiment, or intention — is real and consequential. Misplaced trust in LLM outputs has already produced documented harms: fabricated legal citations submitted in court, incorrect medical information acted upon, automated content moderation decisions made on flawed model judgements.

The benchmark saturation problem deepens this concern. LLMs perform well on the tests humans design to measure reasoning and knowledge because those tests are constructed from human language — and human language patterns are what the models have learned. A model that has ingested the written output of every standardised test, every academic assessment, and every professional certification in its training data is not demonstrating human-equivalent intelligence when it performs well on those assessments. It is demonstrating that its training data contained the patterns the assessments were designed to elicit.

The rebuttal to the stochastic parrot critique is not that it is wrong — it is that it is insufficient. A system that produces outputs functionally indistinguishable from understanding, in the domains and contexts where it operates, has the same consequential weight as understanding for most practical purposes. Whether the output of a diagnostic support tool reflects genuine medical understanding or extraordinarily sophisticated pattern matching over medical literature matters enormously for the philosophy of mind and rather less for the patient whose rare disease was correctly identified. The philosophical question and the practical question are distinct, and conflating them in either direction — treating the model as if it understands, or dismissing its outputs because it does not — produces its own errors.


Conclusion: The First Technology of Meaning

Every significant technology in human history has extended a human capacity: the wheel extended locomotion, the printing press extended memory and distribution, the calculator extended arithmetic. Each operated in a domain adjacent to the human but distinct from it — augmenting what humans do without entering the space of what humans are. Large language models are the first technology to operate directly in the domain of meaning. They produce language, argument, narrative, and analysis — the outputs of the activities through which humans have historically understood themselves as human.

They do not do this the way humans do. The mechanism is not cognition. The process is not understanding. But the outputs are consequential, the effects are real, and the encounter they force — with the question of what thinking is, what language is, and what the human is when those things can be replicated without the human behind them — is not one that can be deferred by dismissing the technology or deferred by overclaiming it. Large language models are, in the most precise sense available, a mirror made of our own words. What we see in it depends entirely on whether we are willing to look at what is actually reflected — and not at what we expected to find.


References

  1. Vaswani, A. et al. "Attention Is All You Need." arXiv:1706.03762. arxiv.org. https://arxiv.org/abs/1706.03762
  2. Deletang, G. et al. "Language Modeling Is Compression." arXiv:2309.10668. arxiv.org. https://arxiv.org/abs/2309.10668
  3. Wei, J. et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903. arxiv.org. https://arxiv.org/abs/2201.11903
  4. OpenAI. "GPT-4 Technical Report." openai.com. https://openai.com/research/gpt-4
  5. Bender, E. et al. "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" ACM FAccT 2021. dl.acm.org. https://dl.acm.org/doi/10.1145/3442188.3445922
  6. Bommasani, R. et al. "On the Opportunities and Risks of Foundation Models." arXiv:2108.07258. arxiv.org. https://arxiv.org/abs/2108.07258
  7. DeepMind. "Research Overview." deepmind.google. https://deepmind.google/research
  8. OpenAI. "Research Overview." openai.com. https://openai.com/research

← Back to Home View All Papers