Part 2 · Chapter 10

Intro to Neural Networks: Baby's First Brain

Perceptrons, activations, forward + backprop, conquering XOR.

Alright, we've built models that can draw lines, ask questions, and find cliques. We've been building with Lego bricks. Now it's time to build the Death Star.

Welcome to Neural Networks.

These are the models that power everything from self-driving cars to generating art to translating languages. They look terrifyingly complex from the outside, like a diagram of the entire internet. But here's the secret: a neural network is just layers upon layers of the simple things we've already learned.

Let's demystify the beast and build our own tiny brain.

The Neuron (Perceptron): Not Actually Brain Surgery

The basic building block of a neural network is a neuron, also known as a perceptron. And a single neuron is going to look incredibly familiar. It's basically just a logistic regression unit.

Here's what a neuron does:

It takes one or more inputs (x₁, x₂, ...).
It calculates a weighted sum of those inputs and adds a bias: $z = w_{1} x_{1} + w_{2} x_{2} + \dots + b$ (Sound familiar? It's the linear regression equation.)
It passes this result, z, through an activation function. (Sound familiar? It's what we did in logistic regression.)

That's it. A single neuron is just a simple linear model followed by a non-linear activation.

Activation Functions: The Neuron's Mood Ring

The activation function is what gives the network its power. It decides whether the neuron "fires" and what signal it sends to the next layer. Without it, a neural network would just be a massive, useless linear regression model.

Sigmoid: The OG. Squishes any number into a (0, 1) range. Great for the final output layer in binary classification. Problem: vanishing gradients.
Tanh (Hyperbolic Tangent): Sigmoid's cooler, zero-centered sibling. Squishes numbers into (-1, 1). Better for hidden layers but still has vanishing gradients.
ReLU (Rectified Linear Unit): The undisputed king of modern deep learning. $f (x) = max (0, x)$ If positive → pass through; if negative → zero. Fast and gradient stays alive for positive values.

live · drag z

z = 0.00 → σ(z) = 0.5000

The Network: Stacking Layers of Neurons

A single neuron is dumb. But what happens when we connect them? We get a network.

Input Layer: Receives the raw data.
Hidden Layers: Intermediate layers where the real "thinking" happens.
Output Layer: Produces the final prediction.

This layering is what allows the network to learn hierarchical patterns. The first layer might learn to recognize simple edges. The next layer might combine those edges to recognize shapes like eyes and noses. The final layer might combine those shapes to recognize a face.

Forward & Backpropagation: The Matrix Moment

How does a network actually learn? It's a two-step dance.

1. Forward Propagation (The Easy Part)

This is just the process of making a prediction. You feed your input data into the first layer. The neurons do their thing (weighted sum + activation) and pass their outputs to the next layer. This continues layer by layer until you get a final output. It's one giant, nested function call.

2. Backpropagation (The "Magic")

After the forward pass, you have a prediction. You compare it to the true label using a loss function (like MSE or Binary Cross-Entropy) to get a single error number.

Now, we need to figure out which of the thousands (or millions) of weights in the network was responsible for the error. Backpropagation is the algorithm that does this. It's just the chain rule from calculus, applied cleverly and efficiently.

It starts at the end, with the loss. It calculates the gradient of the loss with respect to the weights in the last layer.
Then, it moves one layer back. It uses the gradients it just calculated to figure out the gradients for the weights in the second-to-last layer.
It continues this process, propagating the error signal backwards through the network, layer by layer, until it has calculated the gradient (the "blame") for every single weight and bias.

Once you have all the gradients, you just use our old friend Gradient Descent to update every weight and bias:

weight = weight - learning_rate \times gradient

That's the entire training loop: Forward Pass → Calculate Loss → Backward Pass (Backpropagation) → Update Weights. Repeat 10,000 times.

From-Scratch XOR Solver: The "Aha!" Moment

Remember the XOR problem from Chapter 5 that our linear logistic regression model couldn't solve? A single straight line can't separate the XOR data.

But a neural network can. By using a hidden layer, the network can learn to draw more complex, non-linear decision boundaries. The first layer might learn two different lines, and the output layer can learn to combine the results of those lines in a non-linear way (like an AND or OR operation) to create a boundary that perfectly solves XOR.

This is the superpower of neural networks. They are universal function approximators, which is a fancy way of saying that a network with at least one hidden layer and enough neurons can, in theory, learn to approximate any continuous function. This is why they are so powerful and versatile.

Let's prove it. We'll build a tiny neural network from scratch to finally conquer the XOR problem.

Live · tiny neural net learning XOR · epoch 0

learning rate0.50

Watch a 2-4-1 ReLU+sigmoid net carve XOR out of nothing — the same XOR our linear model choked on in Ch. 5.

xor_nn.py

When you run this code, you'll see the loss drop and the model learn to correctly classify all four XOR points. It's the moment you realize that by stacking simple components, you can create something with truly powerful capabilities.