Transformers from first principles :: The embedding

The embedding

From IDs to meaning

We ended the last chapter with a list of integers. "Hello world" might be [104, 2599].

But neural networks cannot work with raw IDs. If you feed these numbers directly, the network assumes that token 104 is close to token 105. In reality, token ID assignment is arbitrary. 104 might be "hello" and 105 might be "apricot". There is no semantic relationship in the numerical value.

We need a way to represent words so that their meaning is encoded in the numbers. We need embeddings.

The lookup table

An embedding layer is essentially a giant lookup table. It is a matrix $W_{\text{embed}}$ where:

The number of rows equals the vocabulary size (e.g., 50,000).
The number of columns equals the embedding dimension (e.g., 512, 768, or 4096 in GPT-4).

Each row is a vector representing one specific token. To "embed" a token, we simply use its integer ID to select the corresponding row.

$e_i = W_{\text{embed}}[i, :]$

Where:

$e_i$ : The resulting embedding vector for token $i$ .
$W_{\text{embed}}$ : The giant matrix containing all token vectors.
$i$ : The integer ID of the token we want to look up.

This is purely an indexing operation. There is no multiplication yet.

Token ID

→

Embedding Matrix (W)

0.1

-0.5

0.8

0.2

-0.4

0.9

0.1

0.9

0.2

-0.1

0.5

0.8

0.3

-0.2

0.4

0.5

0.8

0.1

-0.9

Output Vector

0.1

-0.5

0.8

0.2

Hover over the IDs above. Notice how selecting an ID acts like a spotlight, pulling out one specific row from the matrix. That row is the vector that will flow through the rest of the neural network.

Why vectors?

Why do we need 768 or 4096 numbers to represent a single word?

Because meaning is multi-dimensional. A single number cannot capture the difference between "king" and "queen" while simultaneously capturing the difference between "royal" and "peasant."

In a high-dimensional vector space, different directions correspond to different semantic qualities.

One direction might encode gender (King vs Queen).
Another might encode plurality (King vs Kings).
Another might encode abstractness (King vs Monarchy).

The famous equation King - Man + Woman ≈ Queen works because these concepts are encoded as offset vectors in this space.

From text to tensor

Let us walk through the full pipeline. We start with the sentence "Hello world".

1. Tokenization The string is split and mapped to IDs. "Hello world" $\rightarrow$ [0, 1]

2. Embedding Lookup We have an embedding matrix $W ∈ \mathbb{R}^{50000 imes 512}$ . We grab row 0 and row 1.

$\begin{aligned} e_{\text{hello}} &= W[0] \in \mathbb{R}^{512} \\ \\ e_{\text{world}} &= W[1] \in \mathbb{R}^{512} \end{aligned}$

3. Stacking We stack these vectors into a sequence matrix.

$X = \begin{bmatrix} e_{\text{hello}} \\ \\ e_{\text{world}} \end{bmatrix} \in \mathbb{R}^{2 imes 512}$

This matrix $X$ is the input to the Transformer. The model processes this geometry. It essentially asks: "Given the shape of these vectors, what vector should come next?"

Learned, not defined

Crucially, we do not define these vectors manually.

We do not tell the model that index 0 is "hello" and should have a high value in the "greeting" dimension. We initialize the matrix with random noise.

Over the course of training on trillions of tokens, backpropagation pushes these vectors around. If "cat" and "dog" frequently appear in similar contexts (e.g., "feed the cat", "feed the dog"), the gradient updates will push their vectors closer together.

The model learns meaning by learning usage.

The return trip (Unembedding)

You might be wondering: "If we scramble these vectors during training, how do we ever get text back out? And what stops two words from ending up with the same vector?"

Great questions.

1. How do we decode? We don't "decode" a vector like unzipping a file. We perform a similarity search.

When the model produces a final output vector $v_{out}$ , we multiply it by the entire embedding matrix (transposed). This calculates the dot product between $v_{out}$ and every single word in the vocabulary simultaneously.

\text{Logits} = v_{out} \cdot W_{\text{embed}}^T

Where:

$v_{out}$ : The final vector produced by the model for the current position.
$W_{\text{embed}}^T$ : The transposed embedding matrix (used here as the "unembedding" matrix).

This gives us 50,000 scores. The word with the highest score is the winner. It's like a police lineup: the model draws a sketch (the vector), and we compare that sketch to every suspect (token) to find the closest match.

2. Why no duplicates? You might worry that "cat" and "dog" could accidentally end up with the same vector.

Mathematically, high-dimensional space is incredibly vast. In 4096 dimensions, it is extremely difficult to "accidentally" land in the same spot.

But more importantly, the loss function actively prevents it. If "cat" and "dog" had identical vectors, the model would treat them as the exact same input. It would try to predict "meow" and "bark" for the same vector, creating massive errors. Backpropagation would immediately force the vectors apart to resolve the conflict. Distinct meanings must have distinct vectors to minimize loss.

A note on what we are really learning

The following is a personal reflection and steps outside the rigor of this course.

There is something unsettling about this approach. We are training models on text: on the surface representation of thought, not thought itself.

When you think "I am hungry," there is a rich internal experience: a physical sensation, memories of your last meal, visual imagery of food. Text compresses all that into three words.

We are training models on these shadows of thought. We hope that by learning the statistical patterns of the shadows, the model can reconstruct the object that cast them, the underlying logic and reasoning of intelligence.

Amazingly, it seems to work. But it remains a proxy. We are building a map of the world from the descriptions of travelers, without ever stepping outside ourselves.