Given inputs in denoted by a matrix and a database containing samples in denoted by a matrix , we define:
Now self-attention is simply obtained with (so that ) and . In summary, self-attention layer can take as input any tensor of the form (for any ) has parameters:
and produce (with same and as for the input). is the dimension of the input and is a hyper-parameter of the self-attention layer:
with the convention that (resp. ) is the -th column of (resp. the -th column of ). Note that the notation might be a bit confusing. Recall that is always taking as input a vector and returning a (normalized) vector. In practice, most of the time, we are dealing with batches so that the function is taking as input a matrix (or tensor) and we need to normalize according to the right axis! Named tensor notation see below deals with this notational issue. I also find the interpretation given below helpful:
Mental model for self-attention: self-attention interpreted as taking expectation
where the mappings and represent query, key and value.
Multi-head attention combines several such operations in parallel, and is the concatenation of the results along the feature dimension to which is applied one more linear transformation.
To finish the description of a transformer block, we need to define two last layers: Layer Norm and Feed Forward Network.
The Layer Norm used in the transformer block is particularly simple as it acts on vectors and standardizes it as follows: for , we define
and then the Layer Norm has two parameters and
where we used the natural broadcasting rule for subtracting the mean and dividing by std and is component-wise multiplication.
A Feed Forward Network is an MLP acting on vectors: for , we define
where , , , .
Each of these layers is applied on each of the inputs given to the transformer block as depicted below:
Note that this block is equivariant: if we permute the inputs, then the outputs will be permuted with the same permutation. As a result, the order of the input is irrelevant to the transformer block. In particular, this order cannot be used. The important notion of positional encoding allows us to take order into account. It is a deterministic unique encoding for each time step that is added to the input tokens.
Have a look at Brendan Bycroft’s beautifully crafted interactive explanation of the transformers architecture:
In Transformers using Named Tensor Notation, we derive the formal equations for the Transformer block using named tensor notation.
Now is the time to have fun building a simple transformer block and to think like transformers (open in colab).