Joint Embedding Predictive Architecture: why Yann LeCun thinks next-token prediction is a dead end, and what he proposes instead.
Large language models are genuinely impressive at tasks that can be solved through pattern matching over text — translation, summarisation, code completion. But Yann LeCun argues that this entire class of model is missing something fundamental: a model of the world.
A child who has never read a physics textbook knows that a glass will fall if pushed off a table. They know this because they have built an internal model of physical reality through experience. No amount of text describes the felt continuity of the physical world, and next-token prediction over that text cannot learn it.
| TASK | LLM | HUMAN | WHY THE GAP |
|---|---|---|---|
| Completing a sentence | ✓ | ✓ | Pattern matching over training data |
| Translating language | ✓ | ✓ | Statistical co-occurrence |
| Catching a thrown ball | ✗ | ✓ | Requires a physical world model |
| Planning a route | ✗ | ✓ | Needs multi-step simulation |
| Understanding cause & effect | ✗ | ✓ | Causal reasoning, not correlation |
| Learning from a few examples | ✗ | ✓ | Humans generalise from ~20 examples |
The key insight behind JEPA is to change where predictions are made. Generative models — whether image diffusion or language models — make predictions in data space: predict the next pixel, the next token. This forces the model to capture every irrelevant detail of the world.
JEPA instead trains a model to predict in representation space. Given a context region of an image or video, the model must predict the abstract representation of a masked region — not what the pixels look like, but what the underlying structure is. This is closer to how humans reason: you do not mentally reconstruct every photon when you imagine what is behind a closed door.
“The main idea of JEPA is to predict the representation of the target, not the target itself. This allows the model to be abstract and to discard irrelevant information.”— Yann LeCun
I-JEPA (Image JEPA) is the first published implementation. It operates on a single image divided into patches — the same patch-based approach used in Vision Transformers (ViT). The training procedure uses three learned components.
The Context Encoder processes the visible patches and produces a representation. The Target Encoder is an exponential moving average copy of the context encoder that processes the masked patches — its weights are not updated by backpropagation, only by the EMA rule. The Predictor takes the context representation plus positional queries for the masked locations and produces a predicted representation. The entire system is trained to minimise the L2 distance between the predictor's output and the target encoder's output.
JEPA is not just a better image pre-training method. It is LeCun's proposed foundation for autonomous agents that can plan. The end goal is a system with a JEPA-based world model at its core: an internal simulator that can predict what the world will look like after a given action, in abstract representation space.
With such a model an agent can mentally simulate many possible futures before acting — picking the sequence of actions whose predicted outcome minimises a cost function. This is closer to human deliberate thought than to the reactive pattern-matching of current LLMs.
CLICK ANY STEP FOR AN EXPLANATION
LeCun describes this architecture as the “H-JEPA” hierarchy — multiple JEPA modules operating at different time scales, from motor reflexes to long-horizon planning. The world model at each level predicts the representations of the level above it.