
World Models Need Scientists Too
2026-03-18
In January 2026, Yann LeCun left Meta after twelve years to launch AMI Labs, a Paris-based company (pronounced "ami", for the French "friend") built around the conviction that the future of AI requires moving beyond pure text prediction. Fei-Fei Li's World Labs had already shipped its first commercial product, Marble, in November 2025, and Google DeepMind made Genie 3 available to subscribers in the US. So, between World Labs ($1B), AMI Labs ($1B), and Runway ($315M) alone, well over two billion dollars had flowed into world model startups by the first quarter of 2026.
The message, at least as it circulates through interviews, social media, and investment pitches, seems to be that the era of text-based intelligence has peaked and the next frontier belongs to systems that can see, simulate, and understand physical reality. LeCun has been characteristically blunt in public. In a recent interview, he argued that his JEPA architecture "learns the underlying rules of the world from observation, like a baby learning about gravity," and that "this is the foundation for common sense."
His published technical position, however, is more nuanced. In a recent paper co-authored with Goldfeder, Wyder, and Shwartz-Ziv, LeCun argues for what he calls Superhuman Adaptable Intelligence (SAI) — systems defined not by generality but by the speed with which they adapt to new tasks. Crucially, the paper advocates "hierarchy and diversity of models and modalities," positioning world models as one component (for planning and zero-shot transfer) rather than a wholesale replacement for large language models (LLMs). "SAI doesn't dictate a specific architecture," the authors write. This is a more measured position than the rhetoric suggests, and one that is, as I'll argue, closer to the right answer than either side of the public debate seems to realise.
The risk is that the discourse — amplified by capital flows and social media — hardens into a false dichotomy that the researchers' own technical writing does not support. That dichotomy is one that a seventeenth-century Italian physicist could have helped us avoid.
Three things called "world model"
Before we can assess what world models offer, we need to acknowledge that the term is a mess. It suffers from the same hype that drives the engine of speculative capital and research funding, and at least three very different things are being marketed under the same label.
The first is spatial generation. This is what World Labs does with Marble, taking text or images and producing navigable (virtual) 3D environments using techniques like Gaussian splatting. These are impressive rendering engines, but calling them "world models" is a stretch. They produce environments you can look at, not environments that behave according to learned dynamics. (You can gauge how untethered the terminology has become when a recent NVIDIA paper about correcting camera white balance artefacts in 3D reconstruction gets introduced by a popular science channel as "NVIDIA's Insane AI Found The Math Of Reality.")
The second is interactive simulation. DeepMind's Genie 3 sits here. It generates environments that agents can act within and learn from. This concept has a serious intellectual lineage that goes back through the Ha and Schmidhuber "World Models" paper (2018), Schmidhuber's earlier work on curiosity-driven world models, and Richard Sutton's Dyna architecture (1991), and ultimately to the model-based reinforcement learning (RL) tradition. More recently, MuZero demonstrated that a system can learn a dynamics model from scratch and use it to plan at superhuman level across diverse games. These are systems that learn a transition model and use it for planning.
The third is what LeCun means, and it is the most ambitious. His use of the term comes not from 3D graphics but from cognitive science and control theory. The idea is that an AI system should learn an internal model of how its environment works. This happens not at the level of pixels, but in an abstract representation space that captures the relational structure relevant to prediction and action. You never see the world model directly, but it runs in the background generating the environment that lets an agent think a few steps ahead.
It is this third sense that matters for our purposes, because it makes a claim about understanding rather than just rendering. And it's here that the public framing of world models — even more than LeCun's own technical position — becomes revealing but, I think, also misleading.
What world models can and can't do
In my previous post, I argued that Galileo's route from inclined plane experiments to a universal law of falling bodies involved three distinct phases of reasoning:
- an exploratory phase (gathering data, noticing regularities),
- an abductive phase (inventing the explanatory concept ("uniform acceleration") that accounts for the regularities), and
- a testing phase (deriving predictions from the concept and checking them against further experiment).
World models, in LeCun's sense, are well-suited to the first and third of these phases. A system that has learned the dynamics of its environment can explore that environment systematically, detect statistical regularities in how states evolve, and (crucially) test hypothetical interventions by simulating their consequences before executing them. This is what V-JEPA 2 is designed to do. It learns representations from video that support prediction and, ultimately, planning and robot control.
What world models cannot do, at least as currently conceived, is the abductive phase. They cannot invent a new concept like uniform acceleration. They can update parameters within a fixed representational framework—learning, for instance, that objects in their environment tend to accelerate at a particular rate. But this is learning within a framework, not constructing a new one.
The developmental psychologist Susan Carey has argued for a deep distinction between these two kinds of learning. In her account, most learning is what she calls enrichment. This is acquiring new information within an existing conceptual structure. But at certain critical junctures, such as in child development or in the history of science, something qualitatively different happens. The conceptual structure itself is reorganised. Old concepts are replaced, not merely refined. New representational resources are constructed that are genuinely discontinuous with what came before.
Carey developed this distinction for human cognitive development, but the structural point applies to computational systems as well — some learning requires representational reorganisation rather than parametric refinement. As we saw in the previous post, Galileo's invention of uniform acceleration is a case in point. The concept doesn't exist in the Aristotelian framework he inherited—it isn't a refinement of Aristotle's notion of natural motion; it is a replacement that introduces a new kind of quantity (constant rate of change of velocity) and a new mathematical relationship (distance proportional to the square of time) that have no counterparts in the earlier framework.
A world model that learns by gradient descent over a fixed architecture can do enrichment. That is, it can get better at predicting within its existing representational vocabulary. But Carey's radical conceptual change requires the capacity to restructure the vocabulary itself. And that is a capacity that current world model architectures do not possess. Emerging work on program synthesis and meta-learning is beginning to push this boundary, but the gap between learning new library functions and inventing the concept of uniform acceleration remains vast.
Why reasoning might be linguistic
If world models can't perform abduction on their own, then what can? Here, a somewhat counterintuitive body of work from cognitive science becomes relevant.
Hugo Mercier and Dan Sperber have argued that human reasoning is not, as we might naively assume, a general-purpose mechanism for finding truth. Instead, it evolved as a social capacity for producing and evaluating arguments in the context of communication. Reasoning, on their account, is argumentative and dialogic. We reason well not when thinking alone, but when exchanging justifications with others.
This matters for the world models debate because it suggests that the abductive phase of scientific reasoning — the production of competing explanatory hypotheses and the evaluation of their relative merits — is typically a linguistic practice. Argumentation need not be purely verbal — diagrams, demonstrations, and formal notation all play a role — but the normative ideal of reasoning, such as making claims explicit, articulating implications, or weighing alternatives is strongly associated with linguistic representation. Galileo didn't just "see" that acceleration was uniform by staring at his data. He had to formulate the hypothesis, articulate its implications, and argue that it was superior to alternatives. He did this in letters to Sarpi, in the dialogues of the Discorsi, and in the marginalia of his working papers. The concept of uniform acceleration was developed through language, in conversation with interlocutors both real and imagined—the bicameral mind in action.
If Mercier and Sperber are right, then there is a deep reason why LLMs, which are trained on the accumulated residue of human argumentation, turn out to be effective at tasks that look like reasoning. It isn't an accident, and it isn't mere mimicry. Bender et al. (2021) have argued that LLMs learn distributional statistics over text rather than meaning, and they are right that distributional learning alone does not constitute understanding. But that framing understates the degree to which distributional patterns in text encode argumentative structure, such as the patterns of claim, evidence, and inference that humans have developed over millennia for the purpose of producing and evaluating explanations. That is precisely the capacity needed for the abductive phase.
But (and this is a crucial qualification) LLMs have absorbed these patterns without any grounding in the physical world. They can produce hypotheses, but they cannot meaningfully test them. They can generate plausible explanations, but they have no way of knowing whether those explanations track the actual dynamics of any real system. LLMs hallucinate because they are doing exactly what they were trained to do (i.e. producing linguistically plausible continuations) in the absence of any environmental feedback that could constrain those continuations to reality.
There is a telling exception that proves this point. When LLMs such as Anthropic's Claude are given access to a code execution environment (e.g. a compiler, a type-checker, a test suite) their reliability improves substantially. The environment (or interface) provides the model with verification feedback — the code either runs or it doesn't, the tests either pass or they fail. This isn't quite a world model, but it functions as a rudimentary one, constraining the model's outputs to something that actually works. The same system that confabulates freely about physics becomes remarkably disciplined when its hypotheses are subject to empirical test, even if that "empiricism" is just a unit test returning red or green.
The complementarity thesis
We can now state the argument precisely. World models and language models are not competing paradigms. They are complementary components of a reasoning architecture, each contributing a capacity that the other lacks. This complementarity has been argued before in the neuro-symbolic AI literature and in Lake et al.'s influential case for combining pattern recognition with structured model-building. What the Galileo framing adds is a specific account of why complementarity is needed — not just engineering convenience, but the structure of scientific reasoning itself.
World models provide sensitivity to the structure of the environment through interaction with it. They learn how the world pushes back, they register regularities, and they enable prediction and, through simulated intervention, testing. But they also operate within a fixed representational framework that cannot generate the explanatory concepts needed to interpret what they perceive.
Language models, in contrast, have the ability to generate, articulate, and evaluate explanatory hypotheses. They provide the abductive capacity, and can propose that acceleration is uniform, that trajectories are parabolic, or that the horizontal and vertical components of motion are independent. But they generate these proposals in a vacuum, with no contact with the phenomena they purport to explain.
Each alone recapitulates one of Galileo's failures. A world model without abductive capacity is folio 81r. It's meticulous data but no theoretical framework to make it speak. A language model without environmental grounding is the inverse. It's a theorist who has never entered the laboratory, generating elegant explanations that may bear no relation to how the world actually behaves (see the previous post for further details).
What Galileo achieved was the integration of these capacities. The laboratory (the inclined plane, the table, the floor) provided the structured environment. The theorist (Galileo himself, working through language and mathematics) provided the explanatory concepts. And the integration was achieved through a tight loop — the environment constraining the theory, the theory directing attention within the environment, the doveria esser standing as the point of contact between the two.
It is worth noting that Goldfeder et al.'s SAI framework implicitly supports this complementarity. Their call for modular, diverse architectures — world models for planning, self-supervised learning for general knowledge, specialised modules for specific tasks — is entirely compatible with the integrative picture I'm sketching here. But SAI has a specific gap. It defines intelligence by adaptation speed: how quickly a system can learn new tasks. In Carey's terms, this is enrichment — rapid learning within a fixed representational framework. What SAI does not address is the capacity to restructure the task space itself. Specialisation alone does not produce paradigm shifts. As Kuhn argued, revolutionary science does not arise from individual specialists adapting faster within a shared framework. It emerges at a higher level of abstraction — from scientific communities where competing explanatory frameworks are articulated, debated, and replaced through a marketplace of ideas. A collection of superhuman specialists, each excelling within its domain, will not spontaneously generate the concept of uniform acceleration. For that, you need something that can stand outside any single specialisation and ask whether the framework itself should change. That is precisely what the reconstructive dimension of the agentic digital twin taxonomy (discussed in the next post) is designed to capture.
Convergent diagnosis, divergent prescriptions
It is worth pausing to note that this diagnostic has been reached independently from a very different starting point. Zahavy (2026), writing from Google DeepMind, argues that LLMs are structurally incapable of the "abductive Jump" — the generation of novel explanatory hypotheses — because they lack the sensory grounding that Einstein's thought experiments provided. Using Einstein's formulation of general relativity as a case study, and drawing on Magnani's concept of "manipulative abduction," Zahavy frames hypothesis generation as a fundamentally embodied process rooted in physical simulation. His prescription: physically consistent world models as a "synthetic laboratory" for counterfactual reasoning. The convergence of two independent analyses — one from Galileo's inclined plane, one from Einstein's falling elevator — is itself evidence that the diagnostic is robust.
But the diagnosis is shared; the theory of abduction diverges, and the divergence matters architecturally. Zahavy, following Magnani, locates abduction in embodied simulation — a non-linguistic act. Einstein himself said that "words or the language... do not seem to play any role in my mechanism of thought." But the choice of case study matters here. Einstein's elevator is a private thought experiment; Galileo's reasoning was developed in letters to Sarpi, in the dialogues of the Discorsi, and in the marginalia of his working papers. The concept of uniform acceleration was not born in silent contemplation — it was articulated, argued for, and refined through language. If Mercier and Sperber are right that reasoning is fundamentally argumentative, then the capacity LLMs have absorbed from the accumulated residue of human argumentation is not incidental to abduction but central to it. What LLMs lack is not the abductive capacity itself but the environmental grounding to constrain it. This is a fundamentally different architectural claim: world models are necessary but not sufficient; you also need the argumentative engine.
Zahavy also treats "the Jump" as monolithic — a single cognitive leap from experience to axioms. Carey's enrichment/radical-change distinction provides finer-grained architectural requirements. Both SAI and Zahavy's account optimise within existing representational frameworks (enrichment); neither addresses the capacity to restructure the framework itself. And as argued above, Kuhn's paradigm shifts do not arise from individual specialists — nor from an individual system, however sophisticated its world model — but from scientific communities where competing frameworks are debated and replaced. Zahavy proposes a better laboratory; the question this series addresses is who does the science inside it, and how we govern the result.
The architectural question
LeCun is right that world models are essential. Without them, AI remains ungrounded — fluent but disconnected from reality. And his published work — including the SAI paper's advocacy for modular systems-of-systems — is broadly compatible with the integrative architecture I'll describe in the next post. Agentic digital twins are themselves modular architectures, and the taxonomy we propose doesn't prescribe a monolithic design but maps the design space within which diverse, specialised components can be orchestrated. The gap is not in what LeCun proposes but in what remains unaddressed: the abductive capacity that turns a collection of specialist modules into something capable of genuine scientific reasoning. A world model is not a scientist; it is a laboratory. A laboratory extends the intelligence of the scientist. But a laboratory without a scientist is just a room full of equipment.
The question, then, is not world models or language models, but what kind of architecture integrates both? What system can maintain a learned model of its environment, generate explanatory hypotheses about that environment's dynamics, derive predictions from those hypotheses, test them through simulated or actual intervention, and (most ambitiously) restructure its own representational framework when the hypotheses fail?
That question turns out to have a surprisingly concrete answer, one already being explored in the context of critical infrastructure, healthcare, and climate systems. It's the subject of my next post.
← Previous: What Galileo Knew That AI Doesn't (Yet)
Next in this series: Virtual Scientists for Real Infrastructure →