From a line to a language model

01 — Line

A model is a function with knobs.

Two knobs: w (slope) and b (intercept). They define a line. Drag the sliders to find the line closest to the four yellow dots.

stage 1 of 7

parameters

ŷ = 0.30·x + −0.10

loss (MSE — mean squared error) = 0.000

A line can fit slope, but not bumps.

w0.30

b−0.10

gradient descent · lr=0.05

# ŷ = wx + b
# L  = (1/N) Σ (ŷ − y)²
# dL/dw = (2/N) Σ (ŷ − y) · x
# dL/db = (2/N) Σ (ŷ − y)

data = [(1, 0), (2, 1), (3, 1), (4, 0)]   # (x, y) pairs

def train_line_step(w, b, lr=0.05):
    dw, db = 0.0, 0.0
    N = len(data)
    for x, y in data:
        err = (w * x + b) - y
        dw += (2 / N) * err * x
        db += (2 / N) * err
    return w - lr * dw, b - lr * db

No — linear regression has a one-shot closed form.

For any linear regression, the optimal weights can be computed in a single formula known since Legendre (1805) and Gauss (1809):

w* = (XᵀX)⁻¹ Xᵀy

One matrix inversion, one multiplication. Exact answer, no iteration. For our 4 points it gives w ≈ 0, b ≈ 0.5 (the horizontal line at the mean) with MSE = 0.25 — the same place GD plateaus, just instantly.

So why use GD here? Two reasons:

1. It scales. The closed form needs (XᵀX)⁻¹ — an O(d³) matrix inversion. For datasets with millions of features this is infeasible. GD does O(nd) per step.

2. It generalises. Past stage 1, there is no closed form. The moment we add a sigmoid (stage 2) the loss isn't quadratic anymore and the normal equation breaks. Gradient descent is the only general-purpose tool that scales from a line all the way to a transformer.

Other algorithms exist too — SGD (mini-batch), Adam (adaptive lr per parameter), Newton's method (uses 2nd derivatives), L-BFGS, conjugate gradient. All variations on "follow the gradient downhill". We use plain GD here because it's the simplest version of the framework.

architecture

the line vs the data

02 — Bend

Pipe the line through an S. You get a neuron.

σ(z) takes any real number and squashes it into the range (0, 1). Apply it to your line from before. The line bends, but it's still monotonic — it only goes one way.

stage 2 of 7

parameters

z = 1.00·x + −2.50

ŷ = σ(z) = 1 / (1 + e^−z)

σ caps the output at (0, 1). The line z keeps going, but ŷ flattens against the ceiling and floor — that flattening is the bend.

loss (MSE) = 0.000

A single S curve is still one-directional. Try to fit the bump — you can't.

w1.00

b−2.50

gradient descent · lr=1.5

from math import exp

def sigmoid(z):
    return 1 / (1 + exp(-z))

# ŷ = σ(z),  z = wx + b
# σ'(z) = σ(z) · (1 − σ(z))   ← the chain-rule term
# dL/dw = (2/N) Σ (ŷ − y) · σ'(z) · x
# dL/db = (2/N) Σ (ŷ − y) · σ'(z)

def train_bend_step(w, b, lr=1.5):
    dw, db = 0.0, 0.0
    N = len(data)
    for x, y in data:
        z   = w * x + b
        yh  = sigmoid(z)
        err = yh - y
        dp  = yh * (1 - yh)         # σ'(z)
        dw += (2 / N) * err * dp * x
        db += (2 / N) * err * dp
    return w - lr * dw, b - lr * db

σ is a teaching choice. Production uses ReLU.

The sigmoid saturates at both ends — for large |z| the curve flattens, and its derivative collapses toward zero. That breaks training in two ways.

1. Capped gradient. The derivative is σ'(z) = σ(z)·(1−σ(z)). Both factors live in (0, 1) and sum to 1, so their product is biggest when they're equal — at σ(z) = 0.5, which is at z = 0:

z	σ(z)	σ'(z)
0	0.500	0.250 ← max
±2	0.881	0.105
±5	0.993	0.007
±10	≈ 1	0.00005

No matter what σ sees, its gradient is ≤ 0.25. That ceiling is the problem.

2. Vanishing gradient. Backprop through N layers multiplies N of those derivatives. For a 10-layer sigmoid network, the best-case product is 0.25¹⁰ ≈ 10⁻⁶. The signal arriving at the first layer is rounding-error small — early layers can't learn.

The fix: ReLU. ReLU(z) = max(0, z). For positive z, the derivative is exactly 1. No saturation, no vanishing — through 10 layers the signal is still at full strength.

Production today: GPT, Llama, Claude all use smooth ReLU variants (GELU, SwiGLU). We use σ on this page because the bend is visually intuitive and the math is small enough to do by hand.

architecture

line (top) → bent (bottom)

03 — Stack

Two bends, subtracted. Now we can fit any shape.

One neuron rises, the other catches up. Their difference makes a bump. This is a multi-layer perceptron — the workhorse of every modern model.

stage 3 of 7

parameters

ŷ = σ(w₁x + b₁) − σ(w₂x + b₂)

loss (MSE) = 0.000

A single S can't make a bump. Two of them, subtracted, can.

activation:

neuron 1 (rises)

w₁5.00

b₁−7.50

neuron 2 (catches up)

w₂5.00

b₂−17.50

gradient descent · lr=2.5 · tip: 🎲 random init, then ▶ Train

# Two activations available; pick one.
def sigmoid(z): return 1 / (1 + exp(-z))
def sigmoid_grad(z):
    s = sigmoid(z); return s * (1 - s)

def relu(z):      return max(0, z)
def relu_grad(z): return 1.0 if z > 0 else 0.0

act, act_grad = sigmoid, sigmoid_grad           # or: relu, relu_grad

# ŷ = act(z₁) − act(z₂),  z_k = w_k·x + b_k
# dL/dw₁ =  (2/N) Σ (ŷ − y) · act_grad(z₁) · x
# dL/dw₂ = −(2/N) Σ (ŷ − y) · act_grad(z₂) · x   ← note minus

def train_stack_step(w1, b1, w2, b2, lr=2.5):   # use lr ≈ 0.15 for ReLU
    dw1 = db1 = dw2 = db2 = 0.0
    N = len(data)
    for x, y in data:
        z1, z2 = w1 * x + b1, w2 * x + b2
        a1, a2 = act(z1), act(z2)
        err = (a1 - a2) - y
        f = (2 / N) * err
        dw1 +=  f * act_grad(z1) * x
        db1 +=  f * act_grad(z1)
        dw2 += -f * act_grad(z2) * x
        db2 += -f * act_grad(z2)
    return (w1 - lr * dw1, b1 - lr * db1,
            w2 - lr * dw2, b2 - lr * db2)

Sigmoid works here. At scale, it breaks.

1. The saturation problem returns. Same math as Stage 2 — σ'(z) ≤ 0.25. Stack 10 layers and the gradient shrinks by 0.25¹⁰ ≈ 10⁻⁶. Early layers can't hear the loss. This is the vanishing gradient.

2. Random init makes neurons die. Hit 🎲 a few times and watch — sometimes one neuron's b lands far enough into σ's flat tail that its gradient is essentially zero. The neuron never learns; your MLP collapses to a single neuron. We tightened the init range to make this rare, but at scale (with thousands of neurons) some always end up dead.

3. ReLU fixes both. ReLU(z) = max(0, z). Derivative is exactly 1 for any z > 0, regardless of magnitude. No saturation, no vanishing. Try the activation toggle above — switch to ReLU and watch the bend become a hard corner, the loss jump (defaults aren't optimal for ReLU), then hit ▶ Train and watch GD find a new fit. ReLU can actually reach loss = 0 on this data — the corners line up with the data exactly.

The catch: dying ReLU — if a neuron always sees z < 0, gradient is 0 and it never recovers. But that's one-sided; σ saturates on both ends.

Production today: GPT, Llama, Claude all use smooth ReLU variants — GELU, SwiGLU. Same family, smoother gradients.

architecture (MLP with 2 hidden neurons)

two S-curves and their difference

04 — Embed

Each word becomes a point in space.

Same kind of vector the MLP ate in Stage 3 — just three numbers — except we look it up in a table instead of typing it. One row of E per word. Same word in, same vector out — every time.

stage 4 of 7

embedding matrix E · 9 words × 3 dims

            Drag any slider — everything downstream recomputes live. Pushing a word's embedding far enough can hijack the next-word prediction toward that word, because E is also used to score the next word (tied unembedding). ↺ to restore.
          

words as points (showing d₀, d₁)

The prompt words a fluffy blue creature are highlighted. The third axis d₂ isn't drawn — but it's there, and the math uses all three.

but a lookup is context-free — same word, three meanings

A pure lookup can't disambiguate. mole means a different thing in each of these sentences, yet the lookup hands back the same vector for all three:

American shrew mole

🐀 the animal

one mole of CO₂

🧪 the unit

biopsy of the mole

🩺 the skin spot

That's the limitation. Attention's job (next stage) is to compute a nudge — using the surrounding words — that moves each embedding toward what it actually means here. — Example from 3Blue1Brown.

05 — Attend

Nudge each word's embedding using its surroundings.

After Stage 4 every word is a context-free lookup. Attention computes a nudge (Δe) for each one — using Q·K dot products and softmax — that moves it toward what it actually means here. The result is the refined embedding 3B1B calls e'; in our code we write it h₁ since it flows straight into the MLP next.

stage 5 of 7

Q, K, V — live, on 3B1B's example

Our prompt is Grant Sanderson's example phrase: a fluffy blue creature. Picture creature asking:

"Are there any adjectives sitting in front of me?"

That question is its Query (Q). fluffy and blue each hold up a Key (K) answering "yes — I'm an adjective, here." Big Q·K dot product → big attention weight α after softmax. So far that only decides how much to listen to each word.

What each word actually contributes is its Value (V). Think of it as two separate things a word carries: its Key is the label it advertises to be matched on; its Value is the payload it hands over once matched. Take each word's V, scale it by that word's weight α, and add them all up → the nudge Δe. Add Δe to creature's embedding → it now means fluffy blue creature.

the whole stage in one line · Vaswani et al., Attention Is All You Need (2017)

          Attention(Q, K, V) = softmax(QKᵀ / √d) V
        

Inside-out: QKᵀ scores every query against every key → softmax turns those scores into the weights α (sum to 1) → those weights mix the values V. The softmax(…) half is which words; the V half is what they carry. Their product is the nudge Δe.

Our toy drops the three learned projection matrices (so Q=K=V= the raw embedding) and drops √d — both disclosed in the note below. Every other symbol on this page is exactly this line.

The diagrams below show this happening live, with our trained weights.

Example sentence and Q/K/V framing from Grant Sanderson's 3Blue1Brown — Attention in transformers, visually explained.

embeddings · edit here or on Stage 4

live attention · query = creature

Q_creature = […]

tok (K)	Q·K	α

          Σ α = 1.000
        

from math import exp

def softmax(xs):
    m = max(xs)
    e = [exp(x - m) for x in xs]
    return [v / sum(e) for v in e]

def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

# prompt "a fluffy blue creature" → rows of E (each is 3 numbers)
X = [E[i] for i in (0, 2, 3, 4)]

# identity W_Q = W_K = W_V = I: each embedding is its own Q, K, V
q = X[-1]                          # creature is the asker
scores = [dot(q, k) for k in X]
alpha  = softmax(scores)            # weights, sum to 1

# mix the values, then add back onto the asker (residual)
a  = [sum(alpha[i]*X[i][d] for i in range(len(X))) for d in range(3)]
h1 = [X[-1][d] + a[d] for d in range(3)]

Two teaching shortcuts.

In a real transformer each embedding is reshaped into Q, K, V by three trained matrices (W_Q, W_K, W_V) — most of attention's learned cleverness lives there. We set them to the identity, so each word's embedding plays all three roles. The mechanism is identical; only the projections are trivial.

Real models also divide the scores by √d before softmax, to stop them blowing up as dimension grows. With d = 3 it barely matters, so we skip it. Disclosed, not hidden.

creature's question · every earlier word's answer

Arrow thickness = attention weight α (softmax of Q·K). The number on each arrow is the Q·K dot product before softmax. Drag any Stage-4 slider and every arrow shifts in real time.
Each card shows KEY · VALUE on the same box because our identity projections make them the same vector here — the labels mark the two roles the embedding plays, not two different numbers.

the nudge · Δe = Σ αᵢ · Vᵢ · then add to creature

The mix is Δe — the nudge attention computes. Add it to x_creature to get h₁ (= 3B1B's e'_creature): creature, now knowing it's fluffy and blue. This is what the MLP eats next.

06 — MLP

You already built this. Back in Stage 3.

The attention output h₁ runs through a layer of ReLU neurons — the same stack of bent lines you trained earlier — and the result is added back on (residual). This block is where "what tends to follow what" actually lives.

stage 6 of 7

the Stage-3 machine, vectorised

Stage 3 took one number x and stacked bent lines to fit a shape. Here the same neurons take the 3-number vector h₁ and reshape it: same ReLU, same w·input + b, same residual idea — only the input got wider. Nothing new to learn.

h₂ = h₁ + W₂·ReLU(W₁·h₁ + b₁) + b₂

The MLP does the heavy lifting — bypass it to see.

MLP:

next word, MLP on: …

next word, bypassed: …

Without the MLP the model has only the attention-blended vector — it can't tell what the creature did. The MLP is the knowledge.

def relu(v):    return [max(0.0, x) for x in v]
def matvec(M, v): return [dot(row, v) for row in M]
def add(a, b):  return [x + y for x, y in zip(a, b)]

# h1 is the attention output from Stage 5.
# W1, b1, W2, b2 are trained weights — fit offline, baked in.
hidden = relu(add(matvec(W1, h1), b1))   # 6 ReLU neurons
delta  = add(matvec(W2, hidden), b2)     # back down to 3 numbers
h2     = add(h1, delta)                  # residual: h2 = h1 + delta

h₁ → ReLU neurons → Δ → h₂ = h₁ + Δ

Filled neurons fired (ReLU > 0); greyed ones stayed silent. Same picture as the Stage-3 architecture — just three inputs instead of one.

07 — Generate

Turn the vector into a word. Then pick one.

Dot h₂ against every word's embedding → one logit per word. Softmax → probabilities. Temperature reshapes them. Then sample — and append. Do it again. That's generation.

stage 7 of 7

controls

T (temperature)1.00

mode:

import random, math

# logit per word: dot h2 with each row of E (tied weights, U = Eᵀ)
logits = [dot(h2, E[v]) for v in range(len(VOCAB))]

def generate(logits, T=1.0, greedy=False):
    if greedy or T <= 0:
        return logits.index(max(logits))      # argmax
    probs = softmax([l / T for l in logits])  # temperature reshapes
    u, cum = random.random(), 0.0             # inverse-CDF draw
    for i, p in enumerate(probs):
        cum += p
        if u < cum: return i
    return len(probs) - 1

next_id = generate(logits, T=1.0)
prompt.append(VOCAB[next_id])   # then run the whole pipeline again

the whole ladder, in one sentence

Attention decides which words to mix. The MLP — the stack of bent lines from Stage 3 — decides what the mix means. Unembedding turns it back into words. It's the same loss-minimising machine as the line you started with, scaled up: not a different idea, just the line, stacked and made data-aware.

This toy is GPT scaled up — one head, one block, identity Q/K/V, no positional encoding, no √d scaling, no layer norm, weights hand-fit not trained at scale. The mechanism is the real one.

Yes — and here's the audit trail.

Every click of Generate runs a full attention → MLP → softmax pass on the current prompt. No cache, no precomputed lookup. Three independent checks back this up:

1. Same constants the page ships with. The weights in index.html were fit by tests/fit_transformer.py (a ~150-line numpy + scipy script) and pasted in as constants. Re-run the fit any time; the script self-asserts the canonical numbers (P(roamed) ≈ 80% etc.) before printing them.

2. JS = Python, to floating-point noise. tests/equivalence.py parses the constants AND the JS pipeline straight out of index.html, runs both implementations on 8 prompts × MLP on/off = 16 cases, and asserts all 784 intermediate values agree to 1e-9. Last run: max divergence 6.4 × 10⁻¹⁴.

3. The rollout you see is the rollout Python computes. tests/generate_rollout.py simulates the exact loop the Generate button performs — same forward pass, same sampling — and prints every step's attention pattern, top probabilities, and sampled token. In greedy mode it must match the browser exactly (no RNG); in sample mode the distributions must match (the RNG differs, but probability percentages are identical).

Run: python3 tests/generate_rollout.py. Greedy from a fluffy blue creature → roamed → the → forest → . Compare to clicking Generate with mode = greedy in the browser.

logit = h₂ · E[word], then softmax(/T) → P

word	logit	P (with T)	P

From a line to a language model.

A model is a function with knobs.

No — linear regression has a one-shot closed form.

Pipe the line through an S. You get a neuron.

σ is a teaching choice. Production uses ReLU.

Two bends, subtracted. Now we can fit any shape.

Sigmoid works here. At scale, it breaks.

Each word becomes a point in space.

Nudge each word's embedding using its surroundings.

Two teaching shortcuts.

You already built this. Back in Stage 3.

Turn the vector into a word. Then pick one.

Yes — and here's the audit trail.