Transformer Explained

06-15-2024 2344 words 12 minutes

Contents

Assuming you have basic knowledge on the Transformer model from Google for NLP and Image. This article explains two core concepts in the Transformers, multi-head self attention and positional encoding.

Supposedly you want to translate a English sentence into a Spanish. The source sentence is: “I play soccer”. It has 3 words, “I”, “play” and “soccer”.

$seq_len = 3$, which is the number of words

$d_model = 512$, which is the predefined embedding size of each word.

$depth = 64$, which is from $512 / 8$

So every word is represented by a vector with shape $(1, 512)$. For example, word ‘play’ can be represented by a vector

$[0.12, 0.234, 0.34, 0.654, …]$

Where do $q, k, v$ come from?

For above example, the entire senetence can be represented by a matrix X with shape (3, 512), where each row represents a word.

The q, k and v are all calculated from the X.

$q = np.dot(X, W^q)$

$k = np.dot(X, W^k)$

$v = np.dot(X, W^v)$

where $W^q$, $W^k$ and $W^v$ are both trainable matrix with shape (embedding_size, depth) -> (512, 64), which are initialized then learned from model training.

Generally q, k and v have same shape of (3, 64) which are (seq_len, depth). However q, k and v do not need to have the same shape as long as q.shape[1] == k.shape[1] and k.shape[0] == v.shape[0]

For example,

$q.shape = (M, N)$

$k.shape = (P, N)$

$v.shape = (P, Q)$

Every word has its own q, k and v. So for above example, we have ($q_1$, $k_1$, $v_1$) for word “I”, ($q_2$, $k_2$, $v_2$) for word “play” and ($q_3$, $k_3$, $v_3$) for word “soccer”

For word “I”

“Attention”

$q_1k_1 = np.dot(q_1, k_1.T)$ # shape (M, P)

$q_1k_2 = np.dot(q_1, k_2.T)$ # shape (M, P)

$q_1k_3 = np.dot(q_1, k_3.T)$ # shape (M, P)

“Scaled”

$q_1k_1 = \dfrac{q_1k_1}{\sqrt{depth}}$ # shape (M, P) $q_1k_1 = \dfrac{q_1k_2}{\sqrt{depth}}$ # shape (M, P) $q_1k_1 = \dfrac{q_1k_3}{\sqrt{depth}}$ # shape (M, P)

“Softmax”

($q_1k_1$, $q_1k_2$, $q_1k_3$) = softmax($q_1k_1$, $q_1k_2$, $q_1k_3$, axis=-1)

“Value”

$q_1v_1$ = np.dot($q_1k_1$, $v_1$) # shape (M, Q) $q_1v_2$ = np.dot($q_1k_2$, $v_2$) # shape (M, Q) $q_1v_3$ = np.dot($q_1k_3$, $v_3$) # shape (M, Q)

“Sum”

$z_1 = q_1v_1 + q_1v_2 + q_1v_3 $ # shape (M, Q) -> (seq_len, depth)

For word “play”, repeat above steps with $q_2$, we got

$z_2 = q_2v_1 + q_2v_2 + q_2v_3 $

For word “soccer”, repeat above steps with $q_3$, we got

$z_3 = q_3v_1 + q_3v_2 + q_3v_3 $

Mask

Mask is used to prevent a word from paying attention to following words in decoding while in training. For example,

Translate “I play soccer” from English into Spanish “juego al fútbol”

Sentence “juego al fútbol” is the prediction (decoding) result which was generated word by word. Apparently, for word “al”, it can only attention on its previous word “juego” because it can only see its previous words (and itself, self-attention !), while all words after “al” have not been generated in real prediction. All words in the result senetence can only attention on their previous words. In another words, it can only “look ahead”.

“juego” can only attention on itself “juego”.
“al” can attention on “juego” and itself “al”
“fútbol” can attention on “juego” , “al” and itself “fútbol”

However while in training stage, “al” can actually see “fútbol” because it is provided as a complete sentence if we do not apply a mask. To prevent a word attention on its following words, a mask is defined as a matrix with shape (M, P) and filled with 0 and 1 while generally M = P = seq_len

1
2
3


def create_look_ahead_mask(seq_len):
    mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    return mask

For seq_len = 3, the mask is

$$ \begin{bmatrix} 0. & 1. & 1.\\ 0. & 0. & 1.\\ 0. & 0. & 0. \end{bmatrix} $$

Assuming after step2, $q_1k_1$ was like (random values here)

$$ \begin{bmatrix} .23 & .12 & .87\\ .33 & .43 & .81\\ .29 & .77 & .63 \end{bmatrix} $$

Now we apply the mask with this equation

$q_1k_1$ += mask * -1e9

Then result $q_1k_1$ will be

$$ \begin{bmatrix} .23 & -10000000, & -10000000,\\ .33 & .43 & -10000000,\\ .29 & .77 & .63 \end{bmatrix} $$

At step3 we do softmax($q_1k_1$, axis=-1), the result $q_1k_1$ will be like

$$ \begin{bmatrix} .98 & 0.01 & 0.01\\ .45 & .54 & 0.01\\ .21 & .55 & .24 \end{bmatrix} $$

At step4, we do np.dot($q_1k_1$, $v_1$).

First, assuming $v_1$ is like:

1
2
3


[[0.9298136 , 0.29145515, 0.01303697, 0.17378509, 0.4567542, 0.6651808 , 0.55947316, 0.95787144, 0.32361996, 0.4348917 ],
 [0.43136656, 0.45245242, 0.89111626, 0.2706523 , 0.02862096, 0.14584577, 0.86006594, 0.09985936, 0.8554418 , 0.5693686 ],
 [0.6668292 , 0.48258436, 0.95228314, 0.28719485, 0.47296453, 0.5131457 , 0.12083328, 0.7032275 , 0.04693735, 0.01807821]]

Each row in $v_1$ represents how “important” of a word, “how much you should pay atention to me”

$$ \begin{bmatrix} .98 & 0.01 & 0.01\\ .45 & .54 & 0.01\\ .21 & .55 & .24 \end{bmatrix} \times \begin{bmatrix} v_1 \end{bmatrix} $$

We can observe that:

the 1st word “juego”, its value (the 1st row if $v_1$) is kept (0.98x) while the values of its subsequent words “al” and “fútbol” are discarded (0.01x). So “juego” will only attention on itself.
the 2nd word “al”, it keeps the value of 1st word “juego” by 0.45x, and it keeps the value of itself by 0.54x, and discarded value from 3rd word with 0.01x
the 3rd word “fútbol”, it keeps all values with .21x, .55x and .24x, respectively

the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29



def scaled_dot_product_attention(q, k, v, mask=None):
    """
    Args:
        q { matrix } - (seq_len, depth)
        k { matrix } - (seq_len, depth)
        v { matrix } - (seq_len, depth)
        mask { matrix } - (seq_len, seq_len)
    Returns:
        output: (seq_len, depth)
        attn_weights: (seq_len, seq_len)
        
    Example:
        q = (1, 3)
        k = (4, 3)
        v = (4, 2)
        --->
        qk = (1, 4)
        attn_weight = (1, 4)
        output = (1, 2)
    """
    dk = tf.cast(tf.shape(k)[-1], tf.float32)  # 64 -> 64.0
    qk = tf.matmul(q, k, transpose_b=True)  # (seq_len, depth) * t((seq_len, depth)) = (seq_len, seq_len)
    scaled_qk = qk / tf.math.sqrt(dk)
    if mask is not None:  # masked values are 1, 1 * -1e9 is very small 
        scaled_qk += mask * -1e9  # a very small value will result in ~0 in softmax
    attn_weights = tf.nn.softmax(scaled_qk, axis=-1)  # (seq_len, seq_len)
    output = tf.matmul(attn_weights, v)  # (seq_len, depth)
    return output,  attn_weights

Personally I think the two major components, often hard to understand are

multi-head self attention
positional encoding

In this article, let us go through the positional encoding.

Unlike the RNN which natually knows the order of words in a sentence, the attention model has no knowledge of order of words. Every word attention on itself and all other words within a sentence like a fully connected network.

The solution is to add word order information into the every word’s embedding so (hopefully and indeed) the training/learning process will capture that information and understand the order of words. To do so, we need a position encoding matrix that has the exact shape of raw input which is (seq_len, embed_size) so the two matrics can add up to together by element-wise.

Here is an example:

Supposedly we have a sentence, “I play soccer” and each word is represented by a vector of size 4.

1
2
3


I      -> [0.23, 0.38, 0.98, 0.11]
play   -> [0.99, 0.14, 0.34, 0.28]
soccer -> [0.73, 0.03, 0.92, 0.91]

Somehow we created a position encoded matrix:

1
2
3


I      -> [0.13, 0.22, 0.19, 0.21]
play   -> [1.23, 0.24, 0.14, 0.29]
soccer -> [3.73, 0.13, 0.02, 0.10]

Add the two matrics we got:

1
2
3


[0.36, 0.6 , 1.17, 0.32]
[2.22, 0.38, 0.48, 0.57]
[4.46, 0.16, 0.94, 1.01]

This is the input to the model. Now the questions is how can we get the position encoded matrix.

position encoded matrix

Before we start, we need to understand 4 basic formulas

$$ \sin (A+B) = \sin A * \cos B + \cos A * \sin B $$

$$ \sin (A-B) = \sin A * \cos B - \cos A * \sin B $$

$$ \cos (A+B) = \cos A * \cos B - \sin A * \sin B $$

$$ \cos (A-B) = \cos A * \cos B + \sin A * \sin B $$

It means

$ \sin (2) = \sin (1) * \cos (1) + \cos (1) * \sin (1)$

$ \sin (0.123) = \sin (0.1) * \cos (0.023) + \cos (0.023) * \sin (0.1)$.

1
2
3
4
5


import numpy as np
a, b = 0.1, 0.023
x = np.sin(a+b)
y = np.sin(a)*np.cos(b) + np.cos(a)*np.sin(b)
np.testing.assert_almost_equal(x, y, decimal=10)

The formula for calculating the positional encoding is as follows:

$PE_(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$

$PE_(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)$

where i = 0, 1, 2, … $\frac{d_{model}}{2}$ and pos = 0, 1, 2, … seq_len - 1

Let us work on an example of senetence “I play soccer”. To simply the problem, let us assume the $d_{model}$ == 10 (each word is represented by a vector of size 10). e.g. word “I” has a pos == 0, and it’s position vector be represented by a empty vector of size 10. We need to fill the vector with positional information, one by one.

word “I”

The first word “I” has a pos = 0

1. Intitialize

I -> [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]

2. i = 0

$PE_(pos, 2i)$ = $PE_(0, 0)$ = $\sin(0)$ = 0. $PE_(pos, 2i+1)$ = $PE_(0, 1)$ = $\cos(0)$ = 1.

So we fill 1st and 2nd elements:

I -> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]

2. i = 1

$PE_(pos, 2i)$ = $PE_(0, 2)$ = $\sin(0)$ = 0. $PE_(pos, 2i+1)$ = $PE_(0, 3)$ = $\cos(0)$ = 1.

So we fill 3rd and 4th elements:

I -> [0., 1., 0., 1., 0., 0., 0., 0., 0., 0.]

3. repeat above steps for i = 2, 3 and 4

I -> [0., 1., 0., 1., 0., 1., 0., 1., 0., 1.]

word “play”

The second word “play” has a pos = 1

1. Intitialize

play -> [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]

2. i = 0

$PE_(pos, 2i)$ = $PE_(1, 0)$ = $\sin(1)$ = 0.8414709848078965 $PE_(pos, 2i+1)$ = $PE_(1, 1)$ = $\cos(1)$ = 0.5403023058681398

So we fill 1st and 2nd elements:

play -> [0.8414709848078965, 0.5403023058681398, 0., 0., 0., 0., 0., 0., 0., 0.]

2. i = 1

$PE_(pos, 2i)$ = $PE_(1, 2)$ = $\sin(0.158489319)$ = 0.1578266401303058 $PE_(pos, 2i+1)$ = $PE_(1, 3)$ = $\cos(0.158489319)$ = 0.987466835729271

So we fill 3rd and 4th elements:

play -> [0.8414709848078965, 0.5403023058681398, 0.1578266401303058, 0.987466835729271, 0., 0., 0., 0., 0., 0.]

3. repeat above steps for i = 2, 3 and 4

We do not need to show the full vector here. Do it yourself if you are interested.

4. here comes the interesting part! - what are the vectors for 3rd word “soccer” ?

Turns out, we do not need to calculate the $PE_(2, 2i)$ and $PE_(2, 2i+1)$ directly, because we can get it from $PE_(1, 2i)$ and $PE_(1, 2i+1)$

Remember

$$ \sin (A+B) = \sin A * \cos B + \cos A * \sin B $$

$$ \cos (A+B) = \cos A * \cos B - \sin A * \sin B $$

So when pos = 2 and i = 0,

$\sin(2)$ = $\sin(1+1)$ = $\sin 1 * \cos 1 + \cos 1 * \sin 1$ = play[0] * play[1] + play[1] * play[0]

$\cos(2)$ = $\cos(1+1)$ = $\cos 1 * \cos 1 - \sin 1 * \sin 1$ = play[1] * play[1] - play[0] * play[0]

In a conclusion, both soccer[0] and soccer[1] can be calculated from play[0] and play[1], soccer[2] and soccer[3] can be calculated from play[2] and play[3], and so on.

In another word, their (words’) relative positions can be easily calculated. See below for sample code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


import numpy as np
d_model = 20
base = 1 / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :]//2)) / np.float32(d_model))
print(base)

[[1.00000000e+00 1.00000000e+00 3.98107171e-01 3.98107171e-01
  1.58489319e-01 1.58489319e-01 6.30957344e-02 6.30957344e-02
  2.51188643e-02 2.51188643e-02 1.00000000e-02 1.00000000e-02
  3.98107171e-03 3.98107171e-03 1.58489319e-03 1.58489319e-03
  6.30957344e-04 6.30957344e-04 2.51188643e-04 2.51188643e-04]]

# Now we apply sin() on the odd elements and cos() on the even elements on position 0
x0 = base * 0
x0[:, 0::2] = np.sin( x0[:, 0::2])
x0[:, 1::2] = np.cos( x0[:, 1::2])
print(x0)

[[0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1.]]

# Now we apply sin() on the odd elements and cos() on the even elements on position 1
x1 = base * 1
x1[:, 0::2] = np.sin(x1[:, 0::2])
x1[:, 1::2] = np.cos(x1[:, 1::2])
print(x1)

[[8.41470985e-01 5.40302306e-01 3.87674234e-01 9.21796446e-01
  1.57826640e-01 9.87466836e-01 6.30538780e-02 9.98010124e-01
  2.51162229e-02 9.99684538e-01 9.99983333e-03 9.99950000e-01
  3.98106119e-03 9.99992076e-01 1.58489253e-03 9.99998744e-01
  6.30957303e-04 9.99999801e-01 2.51188641e-04 9.99999968e-01]]

# Now we apply sin() on the odd elements and cos() on the even elements on position 2
x2 = base * 2
x2[:, 0::2] = np.sin(x2[:, 0::2])
x2[:, 1::2] = np.cos(x2[:, 1::2])
print(x2)

[[ 9.09297427e-01 -4.16146837e-01  7.14713463e-01  6.99417376e-01
   3.11697146e-01  9.50181503e-01  1.25856817e-01  9.92048417e-01
   5.02165994e-02  9.98738351e-01  1.99986667e-02  9.99800007e-01
   7.96205928e-03  9.99968302e-01  3.16978108e-03  9.99994976e-01
   1.26191435e-03  9.99999204e-01  5.02377265e-04  9.99999874e-01]]

OK that is enough code. Let us do something simple. Just focus the first two values from x1 and x2

1
2
3
4
5
6
7


import numpy as np
a, b = x1[0][0], x1[0][1]  # a = sin(x), b = cos(x)
c, d = x2[0][0], x2[0][1]  # c = sin(x*2), d = cos(x*2)

# we can calculate any values of x2 from x1 easily
np.testing.assert_almost_equal(a*b + a*b, c, decimal=10)  # sin(2x) = sin(x+x) = sin(x)cos(x) + cos(x)sin(x)
np.testing.assert_almost_equal(b*b - a*a, d, decimal=10)  # cos(2x) = cos(x+x) = cos(x)cos(x) - sin(x)sin(x)