Three ways AI is learning to understand the physical world

Large language models face limitations in areas that require an understanding of the physical world—from robotics to autonomous driving to manufacturing. That cap is pushing investors toward global models, with AMI Labs raising a $1.03 billion seed round shortly after World Labs secured $1 billion.

Large language models (LLMs) excel at processing abstract knowledge by predicting the next token, but they fundamentally lack a basis in physical causality. They cannot reliably predict the physical consequences of actions in the real world.

AI researchers and thought leaders are increasingly vocal about these limitations as the industry tries to push AI out of web browsers and into physical spaces. In an interview with podcaster Dwarkesh Patel, Turing Award winner Richard Sutton warned that LLMs are simply mimicking what people say instead of modeling the world, which limits their ability to learn from experience and adapt to changes in the world.

This is why models based on LLMs, including visual language models (VLMs), can exhibit brittle behavior and break with very small changes to their inputs.

Google DeepMind CEO Demis Hassabis echoed this sentiment in another interview, pointing out that today’s AI models suffer from « jagged intelligence. » They can solve complex math olympiads but fail in basic physics because they lack critical abilities regarding real-world dynamics.

To solve this problem, researchers are shifting focus to building world models that act as internal simulators, allowing AI systems to safely test hypotheses before taking physical action. However, « world patterns » is a general term that covers several different architectural approaches.

This created three different architectural approaches, each with different trade-offs.

JEPA: built for real time

The first major approach focuses on learning latent representations rather than trying to predict the dynamics of the world at the pixel level. Supported by AMI Labs, this method is largely based on the Joint Predictive Embedding Architecture (JEPA).

JEPA models try to mimic the way people make sense of the world. When we observe the world, we don’t remember every single pixel or irrelevant detail in a scene. For example, if you observe a car moving down a street, you track its trajectory and speed; you don’t calculate the exact light reflection on every single leaf of the trees in the background.

JEPA models replicate this human cognitive shortcut. Instead of forcing the neural network to predict exactly what the next frame of the video will look like, the model learns a smaller set of abstract or « latent » features. It discards irrelevant details and focuses entirely on the basic rules of how the elements in the scene interact. This makes the model robust to background noise and small changes that break other models.

This architecture is highly computationally and memory efficient. By ignoring irrelevant details, it requires far fewer training examples and runs with significantly lower latency. These features make it suitable for applications where efficiency and real-time inference are out of the question, such as robotics, self-driving cars, and high-stakes enterprise workflows.

For example, AMI has partnered with healthcare company Nabla to use this architecture to simulate operational complexity and reduce cognitive load in fast-paced healthcare settings.

Jan Lekun, pioneer of the JEPA architecture and co-founder of AMI, explained that JEPA-based world models are designed to be "controllable in the sense that you can give them goals and by design the only thing they can do is achieve those goals" in an interview with Newsweek.

Gaussian characters: made for space

The second approach is based on generative models to build a complete spatial environment from scratch. Adopted by companies such as World Labs, this method takes an initial prompt (can be an image or a text description) and uses a generative model to create a 3D Gaussian strip. Gaussian splat is a technique for rendering 3D scenes using millions of tiny mathematical particles that define geometry and lighting. Unlike flat video generation, these 3D representations can be imported directly into standard physics and 3D engines, such as Unreal Engine, where users and other AI agents can freely navigate and interact with them from any angle.

The main advantage here is a drastic reduction in the time and one-time generation costs required to create complex interactive 3D environments. It addresses the exact problem outlined by World Labs founder Fei-Fei Li, who noted that LLMs end up being « wordsmiths in the dark, » possessing flowery language but lacking spatial intelligence and physical expertise. World Labs’ Marble model gives the AI ​​that missing spatial sense.

Although this approach is not designed for split-second, real-time execution, it has enormous potential for spatial computing, interactive entertainment, industrial design, and building static robotics learning environments. Corporate value is evident in Autodesk’s strong support for World Labs to integrate these models into their industrial design applications.

End-to-end generation: built for scale

The third approach uses an end-to-end generative model to process prompts and user actions, continuously generating scene, physical dynamics, and reactions on the fly. Instead of exporting a static 3D file to an external physics engine, the model itself acts as the engine. It takes an initial prompt along with a continuous stream of user actions and generates subsequent frames of the environment in real-time, naturally computing physics, lighting, and object reactions.

DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category. These models provide an extremely simple interface to generate endless interactive experiences and huge volumes of synthetic data. DeepMind demonstrates this natively with Genie 3, demonstrating how the model maintains strict object persistence and consistent physics at 24 frames per second without relying on a separate memory module.

This approach translates directly into heavy synthetic data factories. Nvidia Cosmos uses this architecture to scale synthetic data and physical AI reasoning, allowing autonomous vehicle and robotics developers to synthesize rare, dangerous edge conditions without the expense or risk of physical testing. Waymo (another subsidiary of Alphabet) build your model of the world on Genie 3 by adapting it to train its self-driving cars.

The disadvantage of this end-to-end generative method is the large computational cost required to continuously render physics and pixels simultaneously. Still, investment is needed to achieve the vision laid out by Hassabis, who argues that a deep internal understanding of physical causation is needed because current AI lacks critical capabilities to operate safely in the real world.

What’s next: hybrid architectures

LLMs will continue to serve as an interface for reasoning and communication, but world models are positioned as the core infrastructure for physical and spatial data pipelines. As the underlying models mature, we see the emergence of hybrid architectures that draw on the strengths of each approach.

For example, cybersecurity startup DeepTempo recently developed LogLM, a model that integrates elements of LLMs and JEPA to detect anomalies and cyberthreats from security and network logs.

Technology

#ways #learning #understand #physical #world

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *