Resource

How GPTs Understand Language

A visual guide to embeddings, vectors, and tokens — the core ideas powering large language models.

Computers don't read words the way people do. Instead, each word is converted into a set of numbers — called an embedding — and placed in a mathematical space. Words with similar meanings end up close together in that space, purely because they tend to appear in similar contexts across billions of sentences.

The diagram below is a simplified 2D slice of that space. Notice how “king”, “queen”, and “prince” cluster together, while “cat”, “dog”, and “bird” form a completely separate group. No one programmed these categories — the model learned them from patterns in text.

ROYALTYANIMALSEMOTIONSTECHNOLOGYkingkingqueenqueenprinceprinceroyalroyalcatcatdogdogbirdbirdfishfishhappyhappyjoyjoysadsadangryangrycodecodedatadatamodelmodelneuralneural2D PROJECTION — REAL EMBEDDINGS HAVE HUNDREDS TO THOUSANDS OF DIMENSIONS
NoteHover any word to highlight it. Real embeddings exist in hundreds or thousands of dimensions — what you see here is a simplified 2D projection.

A vector is simply a list of numbers. Every word gets assigned one — typically hundreds or thousands of numbers long. Each number captures a slightly different aspect of the word's meaning and usage patterns. The exact meaning of each dimension isn't human-readable, but the pattern is what matters.

The table below shows a simplified 6-dimension version. Notice that “king” and “queen” have nearly identical values across all dimensions, while “cat” looks completely different. “Happy” and “sad” share some dimensions but diverge sharply on D4 — the model has learned that both words relate to emotion, but with opposite polarity.

WORDD1D2D3D4D5D6
king0.820.90-0.120.440.67-0.21
queen0.790.87-0.100.410.64-0.18
cat-0.71-0.640.82-0.310.120.56
dog-0.68-0.700.79-0.290.110.57
happy-0.190.310.100.91-0.410.14
sad-0.170.260.08-0.870.37-0.13

BLUE = POSITIVE · RUST = NEGATIVE · REAL EMBEDDINGS HAVE 768–3072 DIMENSIONS

This is why the famous formula works: king − man + woman ≈ queen. The numbers for “man” and “woman” encode gender differently, so the arithmetic nudges the result into exactly the right region of the space.

Before a model can turn words into vectors, it first breaks text into tokens. A token isn't always a full word — common short words are usually a single token, but longer or unusual words get split into recognisable subword pieces.

Punctuation marks are their own tokens. This approach lets the model handle any word it has never seen before by combining familiar pieces — similar to how you might sound out an unfamiliar word syllable by syllable.

The cat sat on the mat.
The#1
cat#2
sat#3
on#4
the#5
mat#6
.#7

7 tokens. GPT-4 uses ~100,000 unique tokens built with Byte Pair Encoding (BPE), letting it handle any word by combining subword pieces.

Embeddings give every word a fixed vector, but the same word can mean different things in different sentences. Attention is the mechanism that resolves this. When the model processes each token it simultaneously looks at every other token in the sequence and assigns a weight to each one — deciding how much that token should influence its understanding of the current position.

The result is a context-aware representation. “bank” next to “river” ends up with a very different internal representation than “bank” next to “money”, even though they start from the same embedding. Click any word below to see which other tokens it attends to most strongly.

TheriverbankwassteepCLICK A WORD TO SEE ITS ATTENTION PATTERN
In practice a transformer runs many attention passes in parallel — each one free to focus on different relationships. One head might specialise in pronouns, another in verb–subject agreement. The outputs are combined, giving the model a rich, multi-angle view of every token.

Among all the tokens in a model's vocabulary there is one special entry: <EOS> — End of Sequence. It has no visible text. Its only job is to signal that the response is complete.

At every step of generation the model produces a probability for every token in its vocabulary, including <EOS>. While the response is still mid-thought that probability stays very low. As the answer reaches a natural conclusion — a full sentence, a closing punctuation mark — the probability of <EOS> rises sharply. When the model samples that token, generation stops and nothing more is added.

The
P(<EOS>) AFTER THIS TOKEN1%
Why this mattersWithout EOS the model would have no principled way to stop. It would continue predicting indefinitely. The token gives generation a learned stopping condition rather than a hard-coded character limit.
EOS is not added by a rule — the model learns when to predict it from training data. Short answers, long essays, code blocks — the model saw them all end with EOS and learned the pattern.