See also: LLMs, NLP

At a high-level, we can think of a transformer model as taking an input sequence of tokens of length and predicting the next token at index . They generally excel at sequence-to-sequence modeling tasks.

Most implementations of transformers are autoregressive, meaning that it predicts future values (index to ) from past values (index to ).

Inference

Mainly derived from Brendan Bycroft’s amazing LLM visualization

Embedding

The smallest unit of understanding for a transformer is a token. This is usually a common sequence of characters like ‘at’ or ‘qu’.

The collection of all the tokens the model understands is its vocabulary. The vocabulary maps the token to its index:

  • Token A: index 0
  • Token B: index 1
  • Token C: index 2

This lookup table is usually trained or determined independently of the LMM itself. Most people use statistical methods based on the data to figure out a good lookup table. One common example of this is BPE.

The first step of a transformer is turning the input text into the appropriate index in the vocabulary table.

Then, we use the token index to select the associated column in the token embedding matrix (e.g. the 3rd token index corresponds to the 3rd column of the token embedding matrix). The values of the token embedding matrix are vectors which we call the token embeddings. The token embedding matrix is where is the dimensionality of this embedding.

Then, based on the index of the token in the input, we use it to select an appropriate column of the position embedding matrix. The dimensionality of this is the same as . We need position embeddings because, unlike RNNs and LSTMs which operate sequentially, Transformers operate over the whole input sequence at once so it loses information related to token order.

Token embeddings are learned during training whereas positional encodings can either be fixed or learned. As both embeddings have the same dimensionality, we simply perform an element-wise addition to get the input embedding.

Running this for all of the input tokens gives us the input embedding matrix of size . This corresponds to a column vector for each token.

Layer Norm

Normalization is an important step in the training of deep neural networks, and it helps improve the stability of the model during training.

We do this for each column of the input embedding matrix separately. The goal is to make the average value in the column equal to 0 and the standard deviation equal to 1. To do this, we find both of these quantities (mean  and standard deviation ) for the column and then subtract the average and divide by the standard deviation. Finally, we multiply by a learned weight and a learned bias .

That is, for each where is a column of the input embedding matrix, then we do

We add an additional small to prevent dividing by zero. This produces the layer norm matrix of size .

Transformer Block

As is common in deep learning, it’s hard to say exactly what each of these layers is doing, but we have some general ideas: the earlier layers tend to focus on learning lower-level features and patterns, while the later layers learn to recognize and understand higher-level abstractions and relationships.

In the context of NLP, the lower layers might learn grammar, syntax, and simple word associations, while the higher layers might capture more complex semantic relationships, discourse structures, and context-dependent meaning.

Self-attention

The first step is to produce three vectors for each of the  columns from the normalized input embedding matrix. These vectors are the:

  • : query vector
  • : key vector
  • : value vector

is the dimensionality of the // vectors (it’s convention to set where is the number of attention heads). For each of , , and , we have associated learned values for the bias and the weights .

To compute , for example, we do (where are the weights of the query matrix and are the biases of the query matrix). Note that this matrix-vector addition isn’t normally mathematically valid as we are adding a matrix of to a vector of but we treat it as an matrix where each column is the original vector.

We can think of this as each self-attention block as a graph with nodes. Then,

  • corresponds roughly to ‘what do I have’
  • corresponds roughly to ‘what am I looking for’
  • corresponds roughly to ‘what information do I share with others’

We can think of ‘attention’ as some node A asking some node B for information:

  1. We do to get the self-attention matrix and then divide by .
  2. Then, we normalize the self-attention matrix with softmax which scales them into probabilities so that each row adds up to a probability of 1.
  3. We finally multiply the normalized self-attention matrix with to get our attention output of size .

The main goal of self-attention is that each column wants to find relevant information from other columns and extract their values, and does so by comparing its query vector to the keys of those other columns. We also add restriction that it can only look in the past (i.e. causal self-attention).

This self-attention step is run in parallel (multi-headed self-attention). To combine the outputs of the attention heads, we simply stack them on top of each other to get the attention output of size .

Projection

Finally, we perform the projection to get the output of the layer. This is a simple matrix-vector multiplication on a per-column basis, with a bias added.

Instead of passing this output directly to the next phase, we add it element-wise to the input embedding. This process is called the residual connection or residual pathway.

MLP

Like with self-attention, we perform a layer normalization before the vectors enter the MLP.

Each MLP block has

  • weights (a matrix)
  • bias (a column vector)
  • projection weights (a matrix)
  • projection bias (a column vector)

Each vector from the layer-normed attention residual then, individually:

  1. to produce a column vector
  2. GELU element wise to produce
  3. to produce a column vector

GELU example

This is assembled into the MLP result of size . Like in the self-attention + projection section, we add the result of the MLP to its input, element-wise to produce the MLP residual.

This marks the end of the transformer block and the output is ready to be passed to the next block.

Output

Finally, at the end of all the transformer blocks, we perform one final softmax, which helps convert the output into probabilities.

Multi-class Probabilities

See also: multi-class classification

The softmax function allows us to map real numbers to probabilities.

The alternative ‘harder’ version to softmax is the argmax function which simply finds the maximum value, sets it to 1.0, and assigns 0.0 to all other values.

In contrast, the softmax operation serves as a “softer” version of that. Due to the exponentiation involved in softmax, the largest value is emphasized and pushed towards 1.0, while still maintaining a probability distribution over all input values. This allows for a more nuanced representation that captures not only the most likely option but also the relative likelihood of other options.

Link to original

We then take this output block and do a final matrix multiply with another set of learned weights called the language modelling head weights (LM weights) which is a matrix.

This produces the logits of size . The name “logits” comes from “log-odds,” i.e., the logarithm of the odds of each token. Finally, we softmax this again to exponentiate the log-odds to normal odds/probabilities.

Now, for each column, we have a probability the model assigns to each word in the vocabulary. Then, we can ‘decode’ the final probability back into a token. For example if we’ve supplied six tokens into the model, we’ll use the output probabilities of the 6th column.

We do this by “sampling from the distribution.” That is, we randomly choose a token, weighted by its probability. For example, a token with a probability of 0.9 will be chosen 90% of the time.

We can also control the “smoothness” of the distribution by using a temperature parameter. A higher temperature will make the distribution more uniform, and a lower temperature will make it more concentrated on the highest probability tokens.