The thousand token rule is now available also in epub and pdf formats.
← Details

Transformers from first principles

This course is fully available for free. Sign in to access all chapters.

Layers of abstraction

The power of connection

In Chapter 1, we saw that a single neuron is limited. It can only draw a straight line. In Chapter 2, we saw how to organize data into tensors and multiply them efficiently.

Now, we combine these ideas. By connecting multiple neurons into layers, we create a multi-layer perceptron (MLP),a structure capable of solving almost any problem.

The universal function approximator

There is a powerful mathematical concept called the universal approximation theorem. It states that a network with at least one hidden layer and a non-linear activation function can approximate any continuous function.

Visualizing the math

Think of this like digital audio or building with LEGO bricks. A smooth sound wave is actually made of thousands of tiny, discrete "steps" or samples. Similarly, in a neural network, any complex shape (like a circle) can be built by adding up simple straight lines (neurons). If the lines are small enough and numerous enough, the result is indistinguishable from a smooth curve.

The "cut" is a boundary

When we say a neuron makes a "cut," we mean it creates a decision boundary. It's like building a fence. One neuron might say, "Everything to the left of this fence is safe," while another says, "Everything below this fence is safe." If you combine them, you create a safe "corner." Add enough fences, and you can enclose any shape.

Experiment below. We are using a hidden layer to approximate an irregular shape (like a potato). Notice how each colored neuron contributes one straight edge to the final shape.

Network output
Hidden neurons
Resolution: 3

A single layer can approximate any smooth convex curve by increasing the number of neurons.

With just 3 neurons, we get a triangle. With 4, a square. With 100, we would have a shape so smooth it is indistinguishable from a perfect curve. This is how "straight lines" build "curves."

Hierarchical thinking

Why layers? Because intelligence is hierarchical.

Think about how you recognize a face. First, you detect simple edges like lines and curves. Then, you combine those edges into shapes like eyes, noses, and mouths. Finally, you combine those shapes to identify a specific face.

In a neural network, each layer extracts "features" from the previous one. The hidden layers (those between the input and output) are where the network builds its own internal "vocabulary" of the data.

Why do we need layers? (depth vs width)

You might ask: *"If one layer of neurons can make any shape, why do we need deep learning? Why not just have one massive layer?"

The answer is composition.

Mathematically, a single layer of ReLU neurons is great at making convex shapes (like the potato above). But what if you want a star or a donut? These shapes are concave or have holes.

To build these, you need multiple steps. Layer 1 acts as the "cutter," creating straight lines. Layer 2 acts as the "builder," combining lines into simple convex shapes like triangles. Layer 3 acts as the "composer," combining those simple shapes to make complex ones, for example, creating a star by learning "triangle A OR triangle B."

Without depth, you would need an astronomically large number of neurons to memorize every possible variation. With depth, the network learns to re-use components (like "edges" or "corners") to build efficient representations.

The math of the forward pass

When data flows through a network, it is just a sequence of matrix operations. But now that we understand the roles of each layer, we can read the equation like a story.

If x\mathbf{x} is our input (the raw pixels), the first step is the Cutter Layer:

h1=f(W1x+b1)\mathbf{h}_1 = f(\mathbf{W}_1 \cdot \mathbf{x} + \mathbf{b}_1)

Where:

  • h1\mathbf{h}_1: The hidden state (the detected features).
  • W1\mathbf{W}_1: The weight matrix (the filter bank).
  • x\mathbf{x}: The input vector (the raw data).
  • b1\mathbf{b}_1: The bias vector (the threshold).
  • ff: The activation function (e.g., ReLU).

Here, W1\mathbf{W}_1 is a bank of filters finding edges, and ff (ReLU) keeps only the strong signals.

The next step is the Builder Layer. It takes the detected edges (h1\mathbf{h}_1) and combines them:

h2=f(W2h1+b2)\mathbf{h}_2 = f(\mathbf{W}_2 \cdot \mathbf{h}_1 + \mathbf{b}_2)

Where:

  • h2\mathbf{h}_2: The second hidden state (combining edges into shapes).
  • W2\mathbf{W}_2: The second layer's weights.
  • h1\mathbf{h}_1: The output from the previous layer.

Finally, the output layer makes the prediction based on the built shapes. The full equation is just these steps nested together:

y=f(W2f(W1x+b1)edges+b2)shapesy = \underbrace{f(\mathbf{W}_2 \cdot \underbrace{f(\mathbf{W}_1 \cdot \mathbf{x} + \mathbf{b}_1)}_{\text{edges}} + \mathbf{b}_2)}_{\text{shapes}}

Where:

  • yy: The final output (the prediction).
  • Nested terms: Show how the raw input x\mathbf{x} is transformed step-by-step into high-level features.

Explore the step-by-step process below.

xxhhhy
Calculation Step
1. Input Data
Raw features (x) enter the network.
x