
In their quest to embed AI agents with human-level performance on complex visual tasks, Meta researchers studied how babies learn—among other things.
“Today, we’re excited to share V-JEPA 2 (Joint Embedding Predictive Architecture 2), the first world model trained on video that enables state-of-the-art understanding and prediction, as well as zero-shot planning and robot control in new environments,” Meta announced on July 11, 2025. “As we work toward our goal of achieving advanced machine intelligence (AMI), it will be important that we have AI systems that can learn about the world as humans do, plan how to execute unfamiliar tasks, and efficiently adapt to the ever-changing world around us.”
Starting at just days old, babies can discern solid objects. At 2 to 4 months, they understand object permanence, that an object or person still exists even when they can’t be seen or heard. By 6 months, they grasp causality, knowing, for example, that when they shake a rattle, it will produce noise. At around 8 months, they begin understanding the concept of gravity. And as they near their first birthday, babies develop shape constancy—the knowledge that the shape of objects doesn’t change depending on the angle from which they’re viewed.
A simple action such as throwing a tennis ball into the air and predicting what it will do requires physical intuition—a sense humans develop before they can even form full sentences, researchers noted. The development of V-JEPA 2 was inspired by that kind of learning.
Yann LeCun, chief AI scientist at Meta, said building a world model is akin to embedding an AI agent with common sense.
“A world model is like an abstract digital twin of reality that an AI can reference to understand the world and predict consequences of its action, and therefore it would be able to plan a course of action to accomplish a given task,” LeCun said. “It does not need millions of trials to learn something new because the world model provides a fundamental understanding of how the world works. The impact of AI that can reason and plan using world models would be vast. Imagine assistive technology that helps people with visual impairment. AI agents in mixed reality could provide guidance through complex tasks, making education more personalized.”
V-JEPA 2 is trained through self-supervised learning using more than 1 million hours of video and 1 million images from diverse sources, according to Meta. The model has two main components—an encoder and a predictor. The encoder processes raw video and outputs embeddings that capture useful semantic information about the observed world. The predictor takes those embeddings and additional contextual data and outputs predictions in the form of new embeddings.
That vast quantity of visual data helps the model learn how the world works, including how people interact with objects, how objects move, and how objects interact with one another.
V-JEPA 2 doesn’t yet measure up to human performance on certain benchmarks, such as the ability to distinguish between physically plausible and implausible scenarios. In one example cited by researchers, a ball rolls down a smooth slope—one version unobstructed, another interrupted by a ramp. In both cases, the video cuts from the start of the ball’s movement to the moment it reaches the bottom. Observers must infer whether the ball’s trajectory was plausible, based on the partial clip. Human subjects achieved near-perfect accuracy in identifying the correct path of the ball. The models, by contrast, performed “at or close to chance.”
“There are several areas we plan to explore further as we continue our work on world models,” Meta said. “Currently, V-JEPA 2 learns and makes predictions at a single time scale. However, many tasks require planning across multiple time scales. Think of breaking down a high-level task into smaller steps, such as loading the dishwasher or baking a cake. We want to focus on training hierarchical JEPA models that are capable of learning, reasoning and planning across multiple temporal and spatial scales. Another important direction will be multimodal JEPA models that can make predictions using a variety of senses, including vision, audio and touch. As always, we look forward to sharing more in the future and continuing the important discussions we’re having with the research community.”
Google, for its part, introduced its own world model on Dec. 4, 2024: Genie 2.
“Today we introduce Genie 2, a foundation world model capable of generating an endless variety of action-controllable, playable 3D environments for training and evaluating embodied agents. Based on a single prompt image, it can be played by a human or AI agent using keyboard and mouse inputs,” the company announced.
Google stated that Genie 2 can simulate virtual worlds, including the consequences of taking any action. Like V-JEPA 2, it was trained on a large-scale video dataset and demonstrates “various emergent capabilities at scale, such as object interactions, complex character animation, physics, and the ability to model and thus predict the behavior of other agents.”