Transformer. LLM, NLP

All you need to know about Positional encodings in Transformer

In RNN, LSTM the words are fed in sequence, and hence it understands the order of words. Recurrence in LSTM will require a lot of operations as the length of the sentence increases. But in transformer, we process all the words in parallel. This helps in decreasing the training time. To keep in mind the order of words, the concept of positional encodings is introduced. It’s a kind of encoding that denotes the position of words. In simple words, we add the positional encodings to our existing word embedding and that will give us the final pre-processed embedding which will be used in the encoder part.

Note:- Before beginning this blog post, I highly recommend visiting my earlier blog post on an overview of transformers.

Different Techniques of Positional embeddings:-

  1. Position embeddings=Index of words

In this case, if the length of the sentence is 30. Then corresponding to each of the words, the index number can be its positional embedding as shown in the below image:-

In this case, if the length of the sentence is 30. Then corresponding to each of the words, the index number can be its positional embedding as shown in the below image:-

Positional embedding and word embedding being added up to give final embedding (Image by Author)

We could use this way of encoding but the problem with this is as the sentence length increases, the large values of positional embedding dominate the original word embedding and hence it distorts the value of word embedding. So we discard this method for our natural language processing task.

2. Position= fraction of length of sentence.

If we convert the embedding values as a fraction of length i.e 1/N where N= number of words, it should work as the values will be limited between 0 and 1. The only loophole here is when we compare two different sentences of different lengths, for a particular index the positional embedding values would be different. In general, the positional word embedding should have the same value for a particular index for different length sentences or it will distort the understanding of the model. So we discard this method for our natural language processing task and we go for the Frequency-based method for positional encoding as mentioned in the original paper “Attention is all you need”.

3. Frequency-based Positional embeddings

The author of the paper came up with a unique idea of using wave frequency to capture positional information.

Sine function(Image by author)

For the first position embedding,

pos=0

d= size of the embedding and should be equal to the dimension of existing embedding.

i= indices of each of the positional embedding dimension / also denoted the frequency(i=0 being the highest frequency)

Positional embeddings (Image by author)

In the first sine curve diagram(where i=4), we have plotted the sine curve with different values of position, where the position denotes the position of the word. As the height of the sine curve depends on the position on x-axis, we can use the height to plot word positions. Since the curve height varies in a fixed length and not dependent on text length, this method can help in overcoming the limitation previously discussed. Please check this awesome video to know more.

Values are between -1 and 1. And with an increase in length, the positional encoding values remain the same. But in the smooth sine curve from below (where i=4), we see the word position 0 and word position 5 distance on the y axis is very small. To overcome this, we increase the frequency(freq=Number of cycles completed in 1 second). If we do that, then in the first sine curve from above (where i=0), we see that the distance between position 0 and position 5 is clearly visible.

The author has used a combination of sin and cosine functions to get these embeddings.

Wave frequency (Image by author)

Let’s code this

Code to generate positional embedding (Image by Author))

Output preview is:-

Above code Output(Image by Author)

Here we can see the nearness/closeness of the 1st and 2nd word is high, so cosine similarity is high while the distance is far between 1st and 9th word, hence the cosine similarity is low.

So that’s it on Positional encodings if you like it feel free to share it with your friends. Until then,

via GIPHY

By Ashis

Leave a Reply

Your email address will not be published. Required fields are marked *