artificial intelligence

How One AI Model Creates a Physical Intuition of Its Environment

The V-JEPA system uses ordinary videos to understand the physics of the real world.

The model demonstrates a notion of “surprise” when shown unphysical scenarios.

Kristina Armitage/Quanta Magazine

Introduction

Here’s a test for infants: Show them a glass of water on a desk. Hide it behind a wooden board. Now move the board toward the glass. If the board keeps going past the glass, as if it weren’t there, are they surprised? Many 6-month-olds are, and by a year, almost all children have an intuitive notion of an object’s permanence, learned through observation. Now some artificial intelligence models do too.

Researchers have developed an AI system that learns about the world via videos and demonstrates a notion of “surprise” when presented with information that goes against the knowledge it has gleaned.

The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), does not make any assumptions about the physics of the world contained in the videos. Nonetheless, it can begin to make sense of how the world works.

“Their claims are, a priori, very plausible, and the results are super interesting,” says Micha Heilbron, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.

Higher Abstractions

As the engineers who build self-driving cars know, it can be hard to get an AI system to reliably make sense of what it sees. Most systems designed to “understand” videos in order to either classify their content (“a person playing tennis,” for example) or identify the contours of an object — say, a car up ahead — work in what’s called “pixel space.” The model essentially treats every pixel in a video as equal in importance.

But these pixel-space models come with limitations. Imagine trying to make sense of a suburban street. If the scene has cars, traffic lights and trees, the model might focus too much on irrelevant details such as the motion of the leaves. It might miss the color of the traffic light, or the positions of nearby cars. “When you go to images or video, you don’t want to work in [pixel] space because there are too many details you don’t want to model,” said Randall Balestriero, a computer scientist at Brown University.

Portrait of a man wearing glasses

Yann LeCun, a computer scientist at New York University and the director of AI research at Meta, created JEPA, a predecessor to V-JEPA that works on still images, in 2022.

École Polytechnique Université Paris-Saclay

The V-JEPA architecture, released in 2024, is designed to avoid these problems. While the specifics of the various artificial neural networks that comprise V-JEPA are complex, the basic concept is simple.

Ordinary pixel-space systems go through a training process that involves masking some pixels in the frames of a video and training neural networks to predict the values of those masked pixels. V-JEPA also masks portions of video frames. But it doesn’t predict what’s behind the masked regions at the level of individual pixels. Rather, it uses higher levels of abstractions, or “latent” representations, to model the content.

Latent representations capture only essential details about data. For example, given line drawings of various cylinders, a neural network called an encoder can learn to convert each image into numbers representing fundamental aspects of each cylinder, such as its height, width, orientation and location. By doing so, the information contained in hundreds or thousands of pixels is converted into a handful of numbers — the latent representations. A separate neural network called a decoder then learns to convert the cylinder’s essential details into an image of the cylinder.

V-JEPA focuses on creating and reproducing latent representations. At a high level, the architecture is split into three parts: encoder 1, encoder 2, and a predictor. First, the training algorithm takes a set of video frames, masks the same set of pixels in all frames, and feeds the frames into encoder 1. Sometimes, the final few frames of the video are fully masked. Encoder 1 converts the masked frames into latent representations. The algorithm also feeds the unmasked frames in their entirety into encoder 2, which converts them into another set of latent representations.

Now the predictor gets into the act. It uses the latent representations produced by encoder 1 to predict the output of encoder 2. In essence, it takes latent representations generated from masked frames and predicts the latent representations generated from the unmasked frames. By re-creating the relevant latent representations, and not the missing pixels of earlier systems, the model learns to see the cars on the road and not fuss about the leaves on the trees.

“This enables the model to discard unnecessary … information and focus on more important aspects of the video,” said Quentin Garrido, a research scientist at Meta. “Discarding unnecessary information is very important and something that V-JEPA aims at doing efficiently.”

Once this pretraining stage is complete, the next step is to tailor V-JEPA to accomplish specific tasks such as classifying images or identifying actions depicted in videos. This adaptation phase requires some human-labeled data. For example, videos have to be tagged with information about the actions contained in them. The adaptation for the final tasks requires much less labeled data than if the whole system had been trained end to end for specific downstream tasks. In addition, the same encoder and predictor networks can be adapted for different tasks.

Intuition Mimic

In February, the V-JEPA team reported how their systems did at understanding the intuitive physical properties of the real world — properties such as object permanence, the constancy of shape and color, and the effects of gravity and collisions. On a test called IntPhys, which requires AI models to identify if the actions happening in a video are physically plausible or implausible, V-JEPA was nearly 98% accurate. A well-known model that predicts in pixel space was only a little better than chance.

Autonomous robots need something like a physical intuition in order to plan their movements and interact with the physical environment.

Wladimir Bulgar/Science Photo Library

The V-JEPA team also explicitly quantified the “surprise” exhibited by their model when its prediction did not match observations. They took a V-JEPA model pretrained on natural videos, fed it new videos, then mathematically calculated the difference between what V-JEPA expected to see in future frames of the video and what actually happened. The team found that the prediction error shot up when the future frames contained physically impossible events. For example, if a ball rolled behind some occluding object and temporarily disappeared from view, the model generated an error when the ball didn’t reappear from behind the object in future frames. The reaction was akin to the intuitive response seen in infants. V-JEPA, one could say, was surprised.

Heilbron is impressed by V-JEPA’s ability. “We know from developmental literature that babies don’t need a lot of exposure to learn these types of intuitive physics,” he said. “It’s compelling that they show that it’s learnable in the first place, and you don’t have to come with all these innate priors.”

Karl Friston, a computational neuroscientist at University College London, thinks that V-JEPA is on the right track in terms of mimicking the “way our brains learn and model the world.” However, it still lacks some fundamental elements. “What is missing from [the] current proposal is a proper encoding of uncertainty,” he said. For example, if the information in the past frames isn’t enough to accurately predict the future frames, the prediction is uncertain, and V-JEPA doesn’t quantify this uncertainty.

In June, the V-JEPA team at Meta released their next-generation 1.2-billion-parameter model, V-JEPA 2, which was pretrained on 22 million videos. They also applied the model to robotics: They showed how to further fine-tune a new predictor network using only about 60 hours of robot data (including videos of the robot and information about its actions), then used the fine-tuned model to plan the robot’s next action. “Such a model can be used to solve simple robotic manipulation tasks and paves the way to future work in this direction,” Garrido said.

To push V-JEPA 2, the team designed a more difficult benchmark for intuitive physics understanding, called IntPhys 2. V-JEPA 2 and other models did only slightly better than chance on these tougher tests. One reason, Garrido said, is that V-JEPA 2 can handle only about a few seconds of video as input and predict a few seconds into the future. Anything longer is forgotten. You could make the comparison again to infants, but Garrido had a different creature in mind. “In a sense, the model’s memory is reminiscent of a goldfish,” he said.

Comment on this article