Seven rungs. Each adds exactly one idea to the one before it. The same loss-minimising machine, from a line all the way to a word.
Two knobs: w (slope) and b (intercept). They define a line. Drag the sliders to find the line closest to the four yellow dots.
# ŷ = wx + b # L = (1/N) Σ (ŷ − y)² # dL/dw = (2/N) Σ (ŷ − y) · x # dL/db = (2/N) Σ (ŷ − y) data = [(1, 0), (2, 1), (3, 1), (4, 0)] # (x, y) pairs def train_line_step(w, b, lr=0.05): dw, db = 0.0, 0.0 N = len(data) for x, y in data: err = (w * x + b) - y dw += (2 / N) * err * x db += (2 / N) * err return w - lr * dw, b - lr * db
For any linear regression, the optimal weights can be computed in a single formula known since Legendre (1805) and Gauss (1809):
w* = (XᵀX)⁻¹ Xᵀy
One matrix inversion, one multiplication. Exact answer, no iteration. For our 4 points it gives w ≈ 0, b ≈ 0.5 (the horizontal line at the mean) with MSE = 0.25 — the same place GD plateaus, just instantly.
So why use GD here? Two reasons:
1. It scales. The closed form needs (XᵀX)⁻¹ — an O(d³) matrix inversion. For datasets with millions of features this is infeasible. GD does O(nd) per step.
2. It generalises. Past stage 1, there is no closed form. The moment we add a sigmoid (stage 2) the loss isn't quadratic anymore and the normal equation breaks. Gradient descent is the only general-purpose tool that scales from a line all the way to a transformer.
Other algorithms exist too — SGD (mini-batch), Adam (adaptive lr per parameter), Newton's method (uses 2nd derivatives), L-BFGS, conjugate gradient. All variations on "follow the gradient downhill". We use plain GD here because it's the simplest version of the framework.
σ(z) takes any real number and squashes it into the range (0, 1). Apply it to your line from before. The line bends, but it's still monotonic — it only goes one way.
from math import exp def sigmoid(z): return 1 / (1 + exp(-z)) # ŷ = σ(z), z = wx + b # σ'(z) = σ(z) · (1 − σ(z)) ← the chain-rule term # dL/dw = (2/N) Σ (ŷ − y) · σ'(z) · x # dL/db = (2/N) Σ (ŷ − y) · σ'(z) def train_bend_step(w, b, lr=1.5): dw, db = 0.0, 0.0 N = len(data) for x, y in data: z = w * x + b yh = sigmoid(z) err = yh - y dp = yh * (1 - yh) # σ'(z) dw += (2 / N) * err * dp * x db += (2 / N) * err * dp return w - lr * dw, b - lr * db
The sigmoid saturates at both ends — for large |z| the curve flattens, and its derivative collapses toward zero. That breaks training in two ways.
1. Capped gradient. The derivative is σ'(z) = σ(z)·(1−σ(z)). Both factors live in (0, 1) and sum to 1, so their product is biggest when they're equal — at σ(z) = 0.5, which is at z = 0:
| z | σ(z) | σ'(z) |
|---|---|---|
| 0 | 0.500 | 0.250 ← max |
| ±2 | 0.881 | 0.105 |
| ±5 | 0.993 | 0.007 |
| ±10 | ≈ 1 | 0.00005 |
No matter what σ sees, its gradient is ≤ 0.25. That ceiling is the problem.
2. Vanishing gradient. Backprop through N layers multiplies N of those derivatives. For a 10-layer sigmoid network, the best-case product is 0.25¹⁰ ≈ 10⁻⁶. The signal arriving at the first layer is rounding-error small — early layers can't learn.
The fix: ReLU. ReLU(z) = max(0, z). For positive z, the derivative is exactly 1. No saturation, no vanishing — through 10 layers the signal is still at full strength.
Production today: GPT, Llama, Claude all use smooth ReLU variants (GELU, SwiGLU). We use σ on this page because the bend is visually intuitive and the math is small enough to do by hand.
One neuron rises, the other catches up. Their difference makes a bump. This is a multi-layer perceptron — the workhorse of every modern model.
# Two activations available; pick one. def sigmoid(z): return 1 / (1 + exp(-z)) def sigmoid_grad(z): s = sigmoid(z); return s * (1 - s) def relu(z): return max(0, z) def relu_grad(z): return 1.0 if z > 0 else 0.0 act, act_grad = sigmoid, sigmoid_grad # or: relu, relu_grad # ŷ = act(z₁) − act(z₂), z_k = w_k·x + b_k # dL/dw₁ = (2/N) Σ (ŷ − y) · act_grad(z₁) · x # dL/dw₂ = −(2/N) Σ (ŷ − y) · act_grad(z₂) · x ← note minus def train_stack_step(w1, b1, w2, b2, lr=2.5): # use lr ≈ 0.15 for ReLU dw1 = db1 = dw2 = db2 = 0.0 N = len(data) for x, y in data: z1, z2 = w1 * x + b1, w2 * x + b2 a1, a2 = act(z1), act(z2) err = (a1 - a2) - y f = (2 / N) * err dw1 += f * act_grad(z1) * x db1 += f * act_grad(z1) dw2 += -f * act_grad(z2) * x db2 += -f * act_grad(z2) return (w1 - lr * dw1, b1 - lr * db1, w2 - lr * dw2, b2 - lr * db2)
1. The saturation problem returns. Same math as Stage 2 — σ'(z) ≤ 0.25. Stack 10 layers and the gradient shrinks by 0.25¹⁰ ≈ 10⁻⁶. Early layers can't hear the loss. This is the vanishing gradient.
2. Random init makes neurons die. Hit 🎲 a few times and watch — sometimes one neuron's b lands far enough into σ's flat tail that its gradient is essentially zero. The neuron never learns; your MLP collapses to a single neuron. We tightened the init range to make this rare, but at scale (with thousands of neurons) some always end up dead.
3. ReLU fixes both. ReLU(z) = max(0, z). Derivative is exactly 1 for any z > 0, regardless of magnitude. No saturation, no vanishing. Try the activation toggle above — switch to ReLU and watch the bend become a hard corner, the loss jump (defaults aren't optimal for ReLU), then hit ▶ Train and watch GD find a new fit. ReLU can actually reach loss = 0 on this data — the corners line up with the data exactly.
The catch: dying ReLU — if a neuron always sees z < 0, gradient is 0 and it never recovers. But that's one-sided; σ saturates on both ends.
Production today: GPT, Llama, Claude all use smooth ReLU variants — GELU, SwiGLU. Same family, smoother gradients.
Same kind of vector the MLP ate in Stage 3 — just three numbers — except we look it up in a table instead of typing it. One row of E per word. Same word in, same vector out — every time.
E is also used to score the next word (tied unembedding). ↺ to restore.
d₂ isn't drawn — but it's there, and the math uses all three.
A pure lookup can't disambiguate. mole means a different thing in each of these sentences, yet the lookup hands back the same vector for all three:
That's the limitation. Attention's job (next stage) is to compute a nudge — using the surrounding words — that moves each embedding toward what it actually means here. — Example from 3Blue1Brown.
After Stage 4 every word is a context-free lookup. Attention computes a nudge (Δe) for each one — using Q·K dot products and softmax — that moves it toward what it actually means here. The result is the refined embedding 3B1B calls e'; in our code we write it h₁ since it flows straight into the MLP next.
Our prompt is Grant Sanderson's example phrase: a fluffy blue creature. Picture creature asking:
"Are there any adjectives sitting in front of me?"
That question is its Query (Q).
fluffy and blue each hold up a Key (K) answering "yes — I'm an adjective, here." Big Q·K dot product → big attention weight α after softmax. So far that only decides how much to listen to each word.
What each word actually contributes is its Value (V). Think of it as two separate things a word carries: its Key is the label it advertises to be matched on; its Value is the payload it hands over once matched. Take each word's V, scale it by that word's weight α, and add them all up → the nudge Δe. Add Δe to creature's embedding → it now means fluffy blue creature.
softmax turns those scores into the weights α (sum to 1) → those weights mix the values V. The softmax(…) half is which words; the V half is what they carry. Their product is the nudge Δe.
√d — both disclosed in the note below. Every other symbol on this page is exactly this line.
The diagrams below show this happening live, with our trained weights.
Example sentence and Q/K/V framing from Grant Sanderson's 3Blue1Brown — Attention in transformers, visually explained.
| tok (K) | Q·K | α |
|---|
from math import exp def softmax(xs): m = max(xs) e = [exp(x - m) for x in xs] return [v / sum(e) for v in e] def dot(a, b): return sum(x * y for x, y in zip(a, b)) # prompt "a fluffy blue creature" → rows of E (each is 3 numbers) X = [E[i] for i in (0, 2, 3, 4)] # identity W_Q = W_K = W_V = I: each embedding is its own Q, K, V q = X[-1] # creature is the asker scores = [dot(q, k) for k in X] alpha = softmax(scores) # weights, sum to 1 # mix the values, then add back onto the asker (residual) a = [sum(alpha[i]*X[i][d] for i in range(len(X))) for d in range(3)] h1 = [X[-1][d] + a[d] for d in range(3)]
In a real transformer each embedding is reshaped into Q, K, V by three trained matrices (W_Q, W_K, W_V) — most of attention's learned cleverness lives there. We set them to the identity, so each word's embedding plays all three roles. The mechanism is identical; only the projections are trivial.
Real models also divide the scores by √d before softmax, to stop them blowing up as dimension grows. With d = 3 it barely matters, so we skip it. Disclosed, not hidden.
α (softmax of Q·K). The number on each arrow is the Q·K dot product before softmax. Drag any Stage-4 slider and every arrow shifts in real time.
x_creature to get h₁ (= 3B1B's e'_creature): creature, now knowing it's fluffy and blue. This is what the MLP eats next.
The attention output h₁ runs through a layer of ReLU neurons — the same stack of bent lines you trained earlier — and the result is added back on (residual). This block is where "what tends to follow what" actually lives.
x and stacked bent lines to fit a shape.
Here the same neurons take the 3-number vector h₁
and reshape it: same ReLU, same w·input + b, same
residual idea — only the input got wider. Nothing new to learn.
def relu(v): return [max(0.0, x) for x in v] def matvec(M, v): return [dot(row, v) for row in M] def add(a, b): return [x + y for x, y in zip(a, b)] # h1 is the attention output from Stage 5. # W1, b1, W2, b2 are trained weights — fit offline, baked in. hidden = relu(add(matvec(W1, h1), b1)) # 6 ReLU neurons delta = add(matvec(W2, hidden), b2) # back down to 3 numbers h2 = add(h1, delta) # residual: h2 = h1 + delta
ReLU > 0); greyed ones stayed silent. Same picture as the Stage-3 architecture — just three inputs instead of one.
Dot h₂ against every word's embedding → one logit per word. Softmax → probabilities. Temperature reshapes them. Then sample — and append. Do it again. That's generation.
import random, math # logit per word: dot h2 with each row of E (tied weights, U = Eᵀ) logits = [dot(h2, E[v]) for v in range(len(VOCAB))] def generate(logits, T=1.0, greedy=False): if greedy or T <= 0: return logits.index(max(logits)) # argmax probs = softmax([l / T for l in logits]) # temperature reshapes u, cum = random.random(), 0.0 # inverse-CDF draw for i, p in enumerate(probs): cum += p if u < cum: return i return len(probs) - 1 next_id = generate(logits, T=1.0) prompt.append(VOCAB[next_id]) # then run the whole pipeline again
Attention decides which words to mix. The MLP — the stack of bent lines from Stage 3 — decides what the mix means. Unembedding turns it back into words. It's the same loss-minimising machine as the line you started with, scaled up: not a different idea, just the line, stacked and made data-aware.
This toy is GPT scaled up — one head, one block, identity Q/K/V, no positional encoding, no √d scaling, no layer norm, weights hand-fit not trained at scale. The mechanism is the real one.
Every click of Generate runs a full attention → MLP → softmax pass on the current prompt. No cache, no precomputed lookup. Three independent checks back this up:
1. Same constants the page ships with. The weights in index.html were fit by tests/fit_transformer.py (a ~150-line numpy + scipy script) and pasted in as constants. Re-run the fit any time; the script self-asserts the canonical numbers (P(roamed) ≈ 80% etc.) before printing them.
2. JS = Python, to floating-point noise. tests/equivalence.py parses the constants AND the JS pipeline straight out of index.html, runs both implementations on 8 prompts × MLP on/off = 16 cases, and asserts all 784 intermediate values agree to 1e-9. Last run: max divergence 6.4 × 10⁻¹⁴.
3. The rollout you see is the rollout Python computes. tests/generate_rollout.py simulates the exact loop the Generate button performs — same forward pass, same sampling — and prints every step's attention pattern, top probabilities, and sampled token. In greedy mode it must match the browser exactly (no RNG); in sample mode the distributions must match (the RNG differs, but probability percentages are identical).
Run: python3 tests/generate_rollout.py. Greedy from a fluffy blue creature → roamed → the → forest → . Compare to clicking Generate with mode = greedy in the browser.
| word | logit | P (with T) | P |
|---|