ResourceYann LeCun

JEPA — A Different Path to AI

Joint Embedding Predictive Architecture: why Yann LeCun thinks next-token prediction is a dead end, and what he proposes instead.

Large language models are genuinely impressive at tasks that can be solved through pattern matching over text — translation, summarisation, code completion. But Yann LeCun argues that this entire class of model is missing something fundamental: a model of the world.

A child who has never read a physics textbook knows that a glass will fall if pushed off a table. They know this because they have built an internal model of physical reality through experience. No amount of text describes the felt continuity of the physical world, and next-token prediction over that text cannot learn it.

TASKLLMHUMANWHY THE GAP
Completing a sentencePattern matching over training data
Translating languageStatistical co-occurrence
Catching a thrown ballRequires a physical world model
Planning a routeNeeds multi-step simulation
Understanding cause & effectCausal reasoning, not correlation
Learning from a few examplesHumans generalise from ~20 examples
LeCun's framingHe calls these missing capabilities the “dark matter of intelligence” — the vast bulk of cognition that is intuitive, embodied, and causal, which language barely touches.

The key insight behind JEPA is to change where predictions are made. Generative models — whether image diffusion or language models — make predictions in data space: predict the next pixel, the next token. This forces the model to capture every irrelevant detail of the world.

JEPA instead trains a model to predict in representation space. Given a context region of an image or video, the model must predict the abstract representation of a masked region — not what the pixels look like, but what the underlying structure is. This is closer to how humans reason: you do not mentally reconstruct every photon when you imagine what is behind a closed door.

GENERATIVEJEPAInputimage / textModeltransformerOutputpixels / tokens↕ LossTargetpixels / tokensInputimage / textEncodercontextPredictorin rep. space↕ LossTarget Encoderstop gradientloss in data spaceloss in rep. space
“The main idea of JEPA is to predict the representation of the target, not the target itself. This allows the model to be abstract and to discard irrelevant information.”— Yann LeCun

I-JEPA (Image JEPA) is the first published implementation. It operates on a single image divided into patches — the same patch-based approach used in Vision Transformers (ViT). The training procedure uses three learned components.

The Context Encoder processes the visible patches and produces a representation. The Target Encoder is an exponential moving average copy of the context encoder that processes the masked patches — its weights are not updated by backpropagation, only by the EMA rule. The Predictor takes the context representation plus positional queries for the masked locations and produces a predicted representation. The entire system is trained to minimise the L2 distance between the predictor's output and the target encoder's output.

contexttargetContext EncoderViT backboneTarget EncoderEMA copy — no gradient ⊘Context reps_xTarget reps_ytarget position queryPredictornarrow transformerPredicted repŝ_yL2 loss
Why EMA for the target encoder?If both encoders are updated by the same gradient signal, the model can collapse — both encoders learn to output a constant regardless of input. Using a slowly-drifting copy (EMA) as the target stabilises training without needing negative samples or contrastive loss.

JEPA is not just a better image pre-training method. It is LeCun's proposed foundation for autonomous agents that can plan. The end goal is a system with a JEPA-based world model at its core: an internal simulator that can predict what the world will look like after a given action, in abstract representation space.

With such a model an agent can mentally simulate many possible futures before acting — picking the sequence of actions whose predicted outcome minimises a cost function. This is closer to human deliberate thought than to the reactive pattern-matching of current LLMs.

Currentstates₀WorldmodelJEPAPredictedstateŝ₁CostfunctiongoalActionpolicya₀new observation — loop repeats

CLICK ANY STEP FOR AN EXPLANATION

LeCun describes this architecture as the “H-JEPA” hierarchy — multiple JEPA modules operating at different time scales, from motor reflexes to long-horizon planning. The world model at each level predicts the representations of the level above it.
Where things standAs of 2024 the video and hierarchical variants are active research. I-JEPA has been demonstrated on ImageNet-scale benchmarks. The full autonomous planning vision remains a research goal rather than a deployed system.