From Lines to Neurons

An intuitive walk through neural networks — building on what you already know about linear regression.

00Your Intuition Is Already Right

In linear regression, you have:

Linear Regression ŷ = w \cdot x + b A line. w = slope, b = intercept.

You try every possible line (infinite configurations of w and b). For each line, you measure how far its predictions are from the actual points — that total distance is the cost. You take the derivative of the cost, find where it's smallest, and that's your best line.

A neural network does exactly the same thing, but instead of one line, it chains together multiple transformations — lines passed through bending functions (activations) — so it can fit curves, not just straight lines.

The Core Insight: Linear regression finds the best line. A neural network finds the best curve — by stacking multiple "lines + bends" together and using the same minimize-the-cost trick.

01Our Tiny Network

We'll use the simplest possible neural network that's more than linear regression: 1 input → 1 hidden neuron → 1 output.

The Problem

We want to learn a function from this tiny dataset:

x (input)	y (true output)
1.0	0.0
2.0	1.0
3.0	1.0
4.0	0.0

This is not a straight line — it goes up then back down. A single linear regression can't fit this. But our tiny network can, because the hidden neuron adds a "bend."

The Parameters (Weights & Biases)

Layer 1 (Input \to Hidden) w₁ = 0.5, b₁ = -0.1

Layer 2 (Hidden \to Output) w₂ = 0.3, b₂ = 0.2

These are random starting values. Just like picking a random initial line in linear regression — we'll improve them via gradient descent.

Analogy: In linear regression you have 2 knobs (w, b). Here we have 4 knobs (w₁, b₁, w₂, b₂). More knobs = more expressive shapes.

02The Forward Pass

The forward pass is just plugging numbers in — left to right — to get a prediction. Like evaluating ŷ = wx + b, but in two stages with a "bend" in between.

1Layer 1: Weighted Sum

z₁ = w₁ \cdot x + b₁ = 0.5 \cdot 1.0 + (-0.1) = 0.4

Same as linear regression — multiply by weight, add bias.

↓

2Activation Function (The "Bend")

Sigmoid function σ(z) = 1 / (1 + e -z) a₁ = σ(0.4) = 1 / (1 + e -0.4) = 1 / (1 + 0.6703) = 0.5987

This squashes the value into (0, 1). This is the key difference from linear regression — this nonlinearity lets the network learn curves.

↓

3Layer 2: Output

ŷ = w₂ \cdot a₁ + b₂ = 0.3 \cdot 0.5987 + 0.2 = 0.3796

Our prediction for x=1.0 is 0.3796. The true y is 0.0. That's wrong — which is expected with random weights!

Forward Pass for All Points

x	z₁	a₁	ŷ	true y
1.0	0.40	0.5987	0.3796	0.0
2.0	0.90	0.7109	0.4133	1.0
3.0	1.40	0.8022	0.4407	1.0
4.0	1.90	0.8699	0.4610	0.0

Python

import numpy as np

# Parameters
w1, b1 = 0.5, -0.1
w2, b2 = 0.3, 0.2

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def forward(x):
    z1 = w1 * x + b1          # Layer 1 linear
    a1 = sigmoid(z1)          # Layer 1 activation
    y_hat = w2 * a1 + b2      # Layer 2 (output)
    return z1, a1, y_hat

# Test with x = 1.0
z1, a1, y_hat = forward(1.0)
print(f"z1={z1:.4f}, a1={a1:.4f}, ŷ={y_hat:.4f}")
# → z1=0.4000, a1=0.5987, ŷ=0.3796

03The Loss (Cost) Function

Exactly the same idea as linear regression. Measure how wrong we are:

Mean Squared Error L = ½ \cdot (ŷ - y)² The ½ is just a convenience — it cancels nicely when we differentiate.

Loss for Each Data Point

x	ŷ	y	error (ŷ−y)	loss ½(ŷ−y)²
1.0	0.3796	0.0	+0.3796	0.0720
2.0	0.4133	1.0	−0.5867	0.1721
3.0	0.4407	1.0	−0.5593	0.1564
4.0	0.4610	0.0	+0.4610	0.1063

Total Loss (average) L_total = (0.0720 + 0.1721 + 0.1564 + 0.1063) / 4 = 0.1267

Same as linear regression: this number tells us how "bad" our current weights are. Our goal is to adjust w₁, b₁, w₂, b₂ to make this number as small as possible.

Python

def loss(y_hat, y):
    return 0.5 * (y_hat - y) ** 2

X = np.array([1.0, 2.0, 3.0, 4.0])
Y = np.array([0.0, 1.0, 1.0, 0.0])

losses = []
for x, y in zip(X, Y):
    _, _, y_hat = forward(x)
    losses.append(loss(y_hat, y))

total_loss = np.mean(losses)
print(f"Total loss: {total_loss:.4f}")
# → Total loss: 0.1267

04Backpropagation

Here's the key insight: backprop is just the chain rule from calculus applied backwards through the network. We want to know: if I wiggle each weight a tiny bit, how much does the loss change?

Analogy: In linear regression, you take dL/dw and dL/db — two derivatives. Here we have a chain of operations, so we use the chain rule to propagate the error backwards through each layer.

The Chain (for one data point, x=1.0, y=0.0)

Our computation graph:

x \to [z₁ = w₁\cdotx + b₁] \to [a₁ = σ(z₁)] \to [ŷ = w₂\cdota₁ + b₂] \to [L = ½(ŷ-y)²]

We go right to left:

1∂L/∂ŷ — How does loss change with prediction?

\partialL/\partialŷ = ŷ - y = 0.3796 - 0.0 = 0.3796

↑ chain rule flows backward

2∂L/∂w₂ and ∂L/∂b₂ — Gradients for Layer 2

\partialŷ/\partialw₂ = a₁ = 0.5987 \partialL/\partialw₂ = \partialL/\partialŷ \cdot \partialŷ/\partialw₂ = 0.3796 \times 0.5987 = 0.2273 \partialŷ/\partialb₂ = 1 \partialL/\partialb₂ = \partialL/\partialŷ \cdot 1 = 0.3796

↑

3∂L/∂a₁ — Propagate through to hidden layer

\partialŷ/\partiala₁ = w₂ = 0.3 \partialL/\partiala₁ = \partialL/\partialŷ \cdot w₂ = 0.3796 \times 0.3 = 0.1139

↑

4Through the sigmoid — ∂a₁/∂z₁

Sigmoid derivative (beautiful property) σ'(z) = σ(z) \cdot (1 - σ(z)) \partiala₁/\partialz₁ = a₁ \cdot (1 - a₁) = 0.5987 \times 0.4013 = 0.2402

↑

5∂L/∂w₁ and ∂L/∂b₁ — Gradients for Layer 1

\partialL/\partialz₁ = \partialL/\partiala₁ \cdot \partiala₁/\partialz₁ = 0.1139 \times 0.2402 = 0.02736 \partialL/\partialw₁ = \partialL/\partialz₁ \cdot x = 0.02736 \times 1.0 = 0.02736 \partialL/\partialb₁ = \partialL/\partialz₁ \cdot 1 = 0.02736

That's backpropagation. It's just repeated chain rule. Each gradient says "if I increase this weight by a tiny ε, the loss changes by this much." Positive gradient → weight is pushing loss up → decrease it. Negative gradient → increase it.

Python — Backprop for one data point

def backward(x, y, z1, a1, y_hat):
    # Output layer gradients
    dL_dyhat = y_hat - y              # ∂L/∂ŷ
    dL_dw2   = dL_dyhat * a1          # ∂L/∂w₂
    dL_db2   = dL_dyhat               # ∂L/∂b₂

    # Hidden layer gradients (chain rule!)
    dL_da1 = dL_dyhat * w2            # ∂L/∂a₁
    da1_dz1 = a1 * (1 - a1)          # σ'(z₁)
    dL_dz1 = dL_da1 * da1_dz1        # ∂L/∂z₁

    dL_dw1 = dL_dz1 * x              # ∂L/∂w₁
    dL_db1 = dL_dz1                   # ∂L/∂b₁

    return dL_dw1, dL_db1, dL_dw2, dL_db2

# For x=1.0, y=0.0
z1, a1, y_hat = forward(1.0)
grads = backward(1.0, 0.0, z1, a1, y_hat)
print(f"dw1={grads[0]:.4f}, db1={grads[1]:.4f}")
print(f"dw2={grads[2]:.4f}, db2={grads[3]:.4f}")
# → dw1=0.0274, db1=0.0274, dw2=0.2273, db2=0.3796

05Gradient Descent

Now we use those gradients to nudge each weight in the direction that reduces the loss. Identical to linear regression.

Update Rule w \leftarrow w - η \cdot \partialL/\partialw η (eta) = learning rate. A small step size, e.g. 0.1

One Update Step (averaged over all 4 data points)

First, we compute gradients for all data points and average them:

x	∂L/∂w₁	∂L/∂b₁	∂L/∂w₂	∂L/∂b₂
1.0	+0.0274	+0.0274	+0.2273	+0.3796
2.0	−0.0721	−0.0361	−0.4168	−0.5867
3.0	−0.0266	−0.0089	−0.4490	−0.5593
4.0	+0.0499	+0.0125	+0.4010	+0.4610
AVG	−0.0054	−0.0013	−0.0594	−0.0764

→Apply updates (η = 0.5)

w₁ \leftarrow 0.5 - 0.5 \times (-0.0054) = 0.5027 b₁ \leftarrow -0.1 - 0.5 \times (-0.0013) = -0.0994 w₂ \leftarrow 0.3 - 0.5 \times (-0.0594) = 0.3297 b₂ \leftarrow 0.2 - 0.5 \times (-0.0764) = 0.2382

The loss decreased slightly. Repeat this 1000 times and it converges to a good fit.

The full loop: Forward pass (get predictions) → Compute loss (how wrong?) → Backprop (which direction to adjust?) → Gradient descent (nudge weights) → Repeat.

06Full Working Code

Here is the complete training loop — everything from sections 2–5 combined into runnable Python:

Python — Complete Neural Network from Scratch

import numpy as np

# ── Data ──
X = np.array([1.0, 2.0, 3.0, 4.0])
Y = np.array([0.0, 1.0, 1.0, 0.0])

# ── Initialize weights (random) ──
w1, b1 = 0.5, -0.1
w2, b2 = 0.3, 0.2
lr = 0.5  # learning rate

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# ── Training Loop ──
for epoch in range(2000):
    # Accumulate gradients over all data points
    gw1, gb1, gw2, gb2 = 0, 0, 0, 0
    total_loss = 0

    for x, y in zip(X, Y):
        # ── FORWARD PASS ──
        z1 = w1 * x + b1
        a1 = sigmoid(z1)
        y_hat = w2 * a1 + b2

        # ── LOSS ──
        total_loss += 0.5 * (y_hat - y) ** 2

        # ── BACKPROPAGATION ──
        dL_dyhat = y_hat - y
        gw2 += dL_dyhat * a1
        gb2 += dL_dyhat
        dL_da1 = dL_dyhat * w2
        dL_dz1 = dL_da1 * a1 * (1 - a1)
        gw1 += dL_dz1 * x
        gb1 += dL_dz1

    # ── GRADIENT DESCENT (average grads) ──
    n = len(X)
    w1 -= lr * gw1 / n
    b1 -= lr * gb1 / n
    w2 -= lr * gw2 / n
    b2 -= lr * gb2 / n

    if epoch % 500 == 0:
        print(f"Epoch {epoch:4d} | Loss: {total_loss/n:.4f}")

# ── Final Predictions ──
print("\nFinal predictions:")
for x, y in zip(X, Y):
    z1 = w1 * x + b1
    a1 = sigmoid(z1)
    y_hat = w2 * a1 + b2
    print(f"  x={x:.1f}  true={y:.1f}  pred={y_hat:.3f}")

Expected output after training: The loss drops from ~0.13 and plateaus around 0.086 — it never reaches zero. The predictions for x=2 and x=3 land near 0.66 instead of 1, and x=4 is stuck at 0.66 instead of 0. The network is doing the best it can — but it cannot fit this data. The next section explains why, and how to fix it.

07Interactive Playground

Drag the sliders to change the weights and see how the network's prediction and loss change in real time. Then hit Train to watch gradient descent find the best weights.

w₁ (input→hidden weight): 0.50 b₁ (hidden bias): -0.10 w₂ (hidden→output weight): 0.30 b₂ (output bias): 0.20

Loss: 0.1267

08Why It Fails — and How to Fix It

If you train this network for thousands of epochs, the loss never reaches zero. It stalls around 0.086. That's not a training bug — it's a capacity limit. With only 1 hidden neuron, this network physically cannot represent our target function.

The data is non-monotonic

Look at our training data again:

Target x: 1 2 3 4 y: 0 1 1 0 Goes down \to up \to up \to down. The output rises, then falls .

That "falls" matters. As x grows, y doesn't keep going one direction — it reverses. This is the 1D version of the famous XOR problem.

What our network can express

Our forward pass is:

ŷ = w₂ \cdot σ(w₁\cdotx + b₁) + b₂

The sigmoid σ is monotonic — it only ever goes up as its input grows. Multiplying by w₂ and adding b₂ is just a linear transform, which preserves monotonicity (or flips it if w₂ < 0, but it stays monotonic). So ŷ as a function of x is monotonic too. It can rise smoothly, or fall smoothly — but it can't rise and then fall.

No matter how long you train, no setting of w₁, b₁, w₂, b₂ produces a bump. The minimum-loss monotonic curve that passes near our points settles at ~0.66 for x ≥ 2 and eats the loss at x=4.

The fix: add a second hidden neuron

With two hidden neurons, we can build a bump:

ŷ = w₂ₐ \cdot σ(w₁ₐ\cdotx + b₁ₐ) + w₂ᵦ \cdot σ(w₁ᵦ\cdotx + b₁ᵦ) + b₂ One sigmoid turns ON near x=1.5; the other turns it back OFF near x=3.5.

Two monotonic curves added together don't have to be monotonic. That's it. That's the whole reason hidden layers need width.

Loss curves and learned functions for 1 vs 2 hidden neurons

Left: 1-neuron loss plateaus immediately while 2-neuron loss drives all the way to ~10⁻²⁹. Right: the 1-neuron model (red) saturates and misses x=4, while the 2-neuron model (green) bends back down through every point.

One catch — initialization matters. A randomly-initialized 2-neuron network often gets stuck at loss ≈ 0.125 with all predictions = 0.5 (predicting the mean). That's the classic dead-gradient failure: small random weights land both sigmoids in their flat tails where σ'(z) = a·(1−a) ≈ 0, so gradients can't flow back to the hidden layer. The script in nn_from_scratch.py hand-picks initial weights that already roughly form a bump. In real networks you fix this with proper initialization schemes (Xavier/He), wider layers (so some neurons land in useful regions by luck), and momentum/Adam optimizers to escape flat regions.

The bigger lesson. Tiny networks on tiny datasets are surprisingly fragile. Most of "deep learning works" is actually "we figured out how to keep gradients alive." Width, depth, smart initializations, residual connections, normalization, and modern optimizers all exist to solve variants of the same underlying problem we just hit: gradients vanishing or the model lacking the capacity to represent the target.

Run python3 nn_from_scratch.py in this directory to reproduce both runs and the plot above.