01 · From Text to Probabilities

GENE 46100 — Unit 01

Haky Im, based on 3Blue1Brown’s What is a GPT

2026-04-12

Where we are

  • slides-00 — LLMs from the outside: chatbot loop, training, attention in one sentence
  • This deck — inside the transformer: how text becomes numbers, what those numbers mean, and how probabilities emerge
  • slides-02 — attention mechanism and full architecture in code

The transformer pipeline

Before diving in: here is the full flow from text to prediction.

  1. Text → tokens
  2. Tokens → vectors via W_E
  3. Vectors pass through Attention → MLP blocks, repeated
  4. Final vector → scores via W_U
  5. Scores → probabilities via softmax
  6. Sample next token

Part 1: Tokens

Text breaks into chunks: tokens

Before any math, text is split into tokens — the atomic units the model works with.

Each token maps to a vector — a list of numbers. This is what the model actually computes with.

Tokens become arrays of numbers

The numbers start as a lookup (same token = same vector). Context changes them later.

The embedding matrix W_E

The first thing a transformer does: look up each token in the embedding matrix W_E.

GPT-3: vocabulary = 50,257 tokens × 12,288 dimensions → 617 million weights just in this one matrix.

Part 2: Semantic Geometry

Similar meanings cluster in space

Word vectors are not random. Words with similar meanings end up close together in the high-dimensional space.

Directions encode meaning

The difference between “man” and “woman” vectors is similar to the difference between “king” and “queen.”

There is a gender direction in embedding space — not programmed in, learned from text.

Semantic directions are consistent

The same direction that encodes gender also generalizes across many word pairs — it’s a stable geometric feature of the space.

Geography works too

The difference between country and capital vectors is consistent across examples.

Subtracting “Germany” from “Italy” and adding to “Hitler” lands near “Mussolini.” Meaning lives in geometry.

Part 3: Dot Products & Context

Dot product = measure of alignment

The dot product is how the model asks: how similar are these two vectors?

  • Positive → vectors point in the same direction (similar)
  • Zero → perpendicular (unrelated)
  • Negative → opposite directions (different)

Dot products detect semantic directions

The plurality direction plur = E(cats) − E(cat) captures what “plural” means geometrically.

Octopodes / octopi / octopuses score high (plural); “one” scores negative (singular).

This same principle drives attention: queries dot-producted with keys measure relevance.

Context window: what the model sees

The model processes a fixed window of tokens at once.

  • GPT-3 context = 2,048 tokens
  • Modern models extend this to 100k+
  • Everything outside the window is invisible to the model

Same word, different meaning

Initial vectors encode the word alone. As vectors flow through the network, they absorb context.

  • Same word → same starting vector
  • Context flows through the network, updating each vector
  • “Snape” near potions → shifts toward one region of meaning
  • “Snape” near betrayal → shifts toward another
  • Same starting point, different destination

Part 4: Unembedding & Output

Getting back to words: W_U

After all transformer layers, the final vector must become a probability distribution over the vocabulary.

The unembedding matrix W_U maps the final vector back to one score per vocabulary token. Same dimensions as W_E, transposed.

Logits: the raw scores

The output of W_U is a vector of 50,000+ raw scores — one per token in the vocabulary.

These scores are called logits — they can be any real number: negative, large, small.

They are not yet probabilities. That is softmax’s job.

Softmax: from scores to probabilities

Raw scores (logits) can be any number. Softmax converts them into a valid probability distribution.

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

  1. \(e^{x_i}\) — forces all values positive
  2. \(\sum_j e^{x_j}\) — normalisation constant
  3. Divide → outputs sum to 1

How softmax behaves

Large differences between scores → one token dominates.

Small differences → distribution spreads out.

Temperature: dialing confidence up or down

Divide logits by temperature \(T\) before softmax:

\(T\) Effect
\(T < 1\) Sharper — model is more decisive
\(T = 1\) Default
\(T > 1\) Flatter — more random, more creative
\(T \to 0\) Always picks the top token (greedy)

The full prediction pipeline

Token → W_E → transformer layers → W_U → softmax → sample next token.

Each pass produces one token. Repeat to generate text.

Part 5: The Attention Block

What attention does

We have vectors. We have dot products. Attention uses both: each token’s vector queries other tokens’ keys to decide how much to attend.

Vectors update each other based on context. “model” in “a machine learning model” vs “a fashion model” — the Attention Block shifts the vector to reflect which meaning applies.

Multiple heads, multiple relationships

Running many attention heads in parallel lets the model track different types of relationships simultaneously — syntactic, semantic, positional.

Each head learns a different kind of relevance: one might track subject–verb agreement, another coreference, another position.

The MLP: asking questions about each vector

After attention, each vector passes independently through a Multilayer Perceptron — asking a long list of questions to update its value.

This is where factual associations are thought to be stored.

Key Takeaways

  • Text → tokens → integer IDs → vectors via W_E (the embedding matrix)
  • Embedding space has geometric structure: similar meanings cluster, directions encode properties
  • Dot products measure vector similarity — the fundamental operation throughout
  • Initial vectors encode the word alone; transformer layers give them context
  • W_U maps the final vector back to vocabulary scores (logits); softmax converts to probabilities
  • Temperature controls how peaked or flat the output distribution is
  • Transformer = [Attention → MLP] × N — attention communicates, MLP processes
  • Next deck: the full Q/K/V math behind attention