Transformers from first principles :: The shape of data

The shape of data

From scalars to structures

In the previous chapter, we dealt with single numbers (scalars). We had a weight $w$ , an input $x$ , and a bias $b$ . But the world is rarely one-dimensional.

An image is a grid of pixels. A sentence is a sequence of words. To handle this complexity, we need to organize our numbers into structures. In AI, these structures are called tensors.

Scalar (0D tensor): A single number. 1.0
Vector (1D tensor): A list of numbers. [1.0, 2.0] (e.g., the coordinates of a point).
Matrix (2D tensor): A grid of numbers. (e.g., a grayscale image).
Tensor (nD tensor): A generic term for any dimension (e.g., a batch of RGB images is 4D: batch × height × width × color).

The shape of data

Understanding the shape of a tensor is the most critical skill in deep learning. If you try to multiply a $(3, 4)$ matrix with a $(5, 2)$ matrix, your code will crash.

Shapes are usually written as (batch_size, dimensions) or (batch, sequence_length, features).

Drag to rotate

Batch (Z)

Height (Y)

Width (X)

Shape: (2, 3, 3)

Matrix multiplication: the engine of AI

The dot product we saw earlier is just a special case of matrix multiplication. This is the operation that consumes 99% of the compute power in training LLMs.

\mathbf{w} \cdot \mathbf{x}

Where:

$\mathbf{w}$ : The weight vector.
$\mathbf{x}$ : The input vector.
$\cdot$ : The dot product operation (multiply corresponding elements and sum).

Mechanically, it works like this:

\begin{bmatrix} a & b \\ c & d \end{bmatrix} \cdot \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} ax + by \\ cx + dy \end{bmatrix}

Where:

$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ : The weight matrix (the transformation).
$\begin{bmatrix} x \\ y \end{bmatrix}$ : The input vector.
$ax + by$ : The dot product of the first row and the input.
$cx + dy$ : The dot product of the second row and the input.

Mental model: parallel template matching

Don't just think of it as multiplying rows and columns. Think of it as applying a bank of filters to a batch of data. Matrix A represents your inputs, where each row is a different image. Matrix B represents your weights, where each column is a different "pattern detector" (like "is there a cat?", "is there a dog?"). The result tells you how much each image matches each pattern.

This is why GPUs are so fast. An NVIDIA A100 doesn't just do one math problem at a time. It uses its thousands of cores to multiply 4,096 user queries (rows) by 12,288 features (columns) simultaneously.

Instead of waiting for one calculation to finish before starting the next, it launches millions of little multiplication tasks in parallel, crushing the entire grid in a fraction of the time a CPU would take.

Row 0

Row 1

Inputs

Weights

Col 0

Col 1

Features

Step Execution

1×5 + 2×7 = 19

Row 0 from Inputs matched against Column 0 of Weights.

Broadcasting: the magic glue

Often, we want to perform operations on tensors of different shapes. For example, adding a single bias number to an entire image, or adding a "brightness" vector to a batch of photos.

Strict linear algebra says "dimensions must match!". Broadcasting is the trick that lets us break this rule.

Why do we need it?

Imagine you have a list of 1,000 item prices, and you want to apply a $1 discount to all of them. The inefficient way involves creating a new list of 1,000 "$ 1" values, then subtracting list A from list B. This wastes massive amounts of memory. The broadcasting way is to keep a single scalar $1 and tell the processor: "Subtract this one value from every item in that list."

It doesn't "change the math", it allows us to express a global rule (the discount) and apply it to local data (the prices) without manually copying data.

Mental model: the stamp

Imagine you have a single vector [1, 2] (let's say, "add 1 to red, add 2 to blue"). You have a matrix representing 10 different pixels. Broadcasting takes your single "rule" vector and stamps it onto every single row of the matrix. It pretends the vector is repeated enough times to match the matrix shape.

Vector (Source)

Matrix (Target)

Expansion Process

Step 1: Mismatched Dimensions

We have a vector (3) and a matrix (3, 3). Standard math says they can't be added.

This feature makes code concise and memory-efficient.

Under the hood: why is this efficient?

You might wonder: "Doesn't the GPU need the full matrix to add them up anyway? Doesn't it have to expand it in memory?"

The answer is no.

If you manually created the expanded matrix, you would allocate gigabytes of VRAM just to store duplicate numbers. Broadcasting avoids this by using a technique called strides.

Physically, the GPU holds the massive matrix (say, 10GB) and the tiny vector (1KB) separately. The magic happens during the calculation. When the kernel calculates the sum for Row 1, it reads the address of the vector (e.g., 0x100). When it moves to Row 2, instead of looking for a "copy" of the vector, it is instructed to simply read that exact same address 0x100 again.

This creates a virtual shape. It looks like a huge matrix to your mathematical operations, but in physical memory, it is just one tiny vector being read repeatedly.

Analogy:

Think of a factory line where workers need to attach an instruction manual to every product. The inefficient way—manual expansion, would be printing 1,000 copies of the manual and taping one to each box. The efficient way, broadcasting, is hanging one big sign on the wall that every worker looks at simultaneously while they work.

Broadcasting is the sign on the wall.

Who handles this?

Usually, the deep learning framework you use (like PyTorch or TensorFlow) handles these memory addresses for you automatically.

When you write matrix + vector, the framework generates a specific kernel, a small, highly optimized program that runs directly on the hardware accelerator. While the framework automates this for 99% of use cases, an advanced developer can write these kernels manually (using languages like CUDA, Metal, or HLSL) to squeeze out every bit of performance or implement custom mathematical operations.

Are these like shaders?

Yes, exactly. If you come from a 3D graphics background, you can think of these kernels as compute shaders. Graphics shaders run a small program for every pixel on the screen to calculate color, while AI kernels run a small program for every element in a tensor to calculate math. They run on the same hardware units and use the same parallel processing philosophy.