Transformers from first principles :: From scratch to 97%

From scratch to 97%

🖥️

Desktop recommended

This chapter contains interactive visualizations and code explorers that work best on a larger screen.

We have covered the individual parts: the neuron, the matrix, the layers, and the "blame game" of backpropagation. Now, we are going to put them all together and build something real.

You are about to build a neural network using WebGPU compute shaders, the same technology that powers modern AI training.

How this works

Below you will find a series of code cells, similar to a Jupyter notebook. Each cell builds on the previous one:

Run cells in order - Click the Run button on each cell, starting from the first
Watch the output - Each cell shows what happened: tensor values, computation results
Explore the shaders - Click "GPU" tabs to see the WGSL shader code that would run on the graphics card
Train the network - The final cell lets you train on 300 handwritten digits and watch the loss converge

By the end, you will have trained a neural network to recognize handwritten digits.

Interactive code explorer — Browse the complete neural network implementation below. Use ⌘ + click to jump to function definitions, or ⌥ + click to find all references.

Use the Next button to step through the execution trace and see how data flows through the network during training.

GPU Setup

Tensors

Activations

Network

Training

tensor.js

⌘ + click to follow def | ⌥ + click for references

// ════════════
// GPU TENSORS: Data lives in Video RAM (VRAM), not RAM
// ════════════
//
// Why separate memory? GPUs are optimized for massive
// parallelism, not for talking to the CPU. The data pathway
// between CPU and GPU (PCIe bus) is relatively slow.
// So we want to:
//   1. Copy data to GPU once
//   2. Do ALL our compute work there
//   3. Only copy results back when absolutely necessary
 
// ════════════
// CREATE A TENSOR FROM A JAVASCRIPT ARRAY
// This is where the CPU → GPU copy happens
// ════════════
function createTensor(device, data, shape) {
  // How much GPU memory do we need?
  // Each number is a 32-bit float = 4 bytes
  const byteSize = data.length * 4;
 
  // ────────────
  // ALLOCATE GPU MEMORY
  // ────────────
  // This is like malloc() but on the GPU. We specify:
  //
  // size: How many bytes to allocate
  //
  // usage: What operations we'll perform on this buffer
  //   - STORAGE: Can be read/written by compute shaders
  //   - COPY_SRC: Can copy data FROM this buffer (GPU→CPU)
  //   - COPY_DST: Can copy data TO this buffer (CPU→GPU)
  //
  // mappedAtCreation: Start with buffer "mapped" to CPU
  //   The driver sets up a memory region both can access.
  //   While mapped, we can write to it like a regular array.
  //   The driver handles the actual transfer when we unmap.
 
  const buffer = device.createBuffer({
    size: byteSize,
    usage: GPUBufferUsage.STORAGE |
           GPUBufferUsage.COPY_SRC |
           GPUBufferUsage.COPY_DST,
    mappedAtCreation: true,
  });
 
  // ────────────
  // WRITE DATA TO GPU MEMORY
  // ────────────
  // getMappedRange() returns an ArrayBuffer backed by GPU
  // memory. We wrap it in Float32Array to write 32-bit floats.
  // This is a direct memory write - straight to VRAM!
 
  const gpuMemory = buffer.getMappedRange();
  const gpuArray = new Float32Array(gpuMemory);
  gpuArray.set(data);
 
  // ────────────
  // UNMAP THE BUFFER
  // ────────────
  // CRITICAL: We MUST unmap before the GPU can use this!
  // While mapped, the buffer is "owned" by the CPU.
  // Unmapping transfers ownership back to the GPU.
  // After this, gpuArray becomes invalid.
 
  buffer.unmap();
 
  return { device, buffer, shape };
}
 
// ════════════
// READ TENSOR DATA BACK TO CPU
// This is the reverse: GPU → CPU copy
// ════════════
async function readTensor(tensor) {
  const { device, buffer, shape } = tensor;
  const size = shape.reduce((a, b) => a * b, 1);
 
  // ────────────
  // CREATE A STAGING BUFFER
  // ────────────
  // We can't map the original buffer directly (it's busy).
  // Instead, we create a temporary "staging" buffer with
  // MAP_READ usage, copy our data into it, then map THAT.
 
  const stagingBuffer = device.createBuffer({
    size: size * 4,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
 
  // Copy GPU → staging buffer
  const cmd = device.createCommandEncoder();
  cmd.copyBufferToBuffer(buffer, 0, stagingBuffer, 0, size * 4);
  device.queue.submit([cmd.finish()]);
 
  // Map and read
  await stagingBuffer.mapAsync(GPUMapMode.READ);
  const copyArray = new Float32Array(stagingBuffer.getMappedRange());
  const result = new Float32Array(copyArray);
  stagingBuffer.unmap();
 
  return result;
}
 
// ════════════
// TEST: Create a 2×3 tensor
// ════════════
const data = new Float32Array([1, 2, 3, 4, 5, 6]);
const tensor = createTensor(device, data, [2, 3]);
 
// The data is now on the GPU!
//
// IMPORTANT: The GPU buffer is just a flat array of numbers.
// The "shape" is metadata WE keep track of - it tells us
// how to interpret the flat data.
//
// shape [2, 3] means "read this as 2 rows, 3 columns":
//   Buffer: [1, 2, 3, 4, 5, 6]
//            └──row0──┘└──row1──┘
//   Row 0:  [1, 2, 3]
//   Row 1:  [4, 5, 6]
//
// To find element [row, col] in a matrix with C columns:
//   index = row * C + col
// Example: element [1, 2] = buffer[1 * 3 + 2] = buffer[5] = 6
 
console.log("Created tensor with shape:", tensor.shape);

Neural network training

Step through gradient descent

Forward pass

Compute loss

Backpropagation

Update weights

Step 1/40

Start training to see detailed metrics