Transformers : [You just need Attention]

Natural language processing or NLP is a subset of machine learning that deals with text analytics. It is concerned with the interaction of human language and computers. There have been different NLP techniques and the latest one is the state-of-the-art Transformers.

This is going to be the very first series of blog posts on Transformers architecture. Transformers have proved revolutionary in the field of Natural language processing(NLP). Ever since its invention, it has replaced all other Natural language processing(NLP) architectures such as Recurrent Neural Network(RNN), Convolutional Neural Network(CNN), and Long- Short term memory(LSTM). The transformers have an encoder-decoder structure and certain attention mechanism that gives State of the Art results in many tasks.

What are we going to learn in this Post?

This blog post focuses on the Encoder part and all its components will be discussed in detail. In this post, the inside details of the encoder will be discussed both theoretically and practically(code notebook included).
Encoder structure
Key Components of an encoder or How the data flow in an encoder
Input -converted to embeddings.
Positional encodings
Multi-head attention layer (self-attention mechanism)
Residual network
Add and norm layer
Feed-forward layer.

Overview of Architecture

Different parts of transformers as per the flow

This blogpost will focus on the left part of the above image.

Input pre-processing
Encoder

1. Input-Preprocessing

Let’s zoom into the input pre-processing image as shown below.

Take for example:- “I went home”. If this sentence were to be passed into the encoder as input, let’s see the flow.

Step 1:- Word Embeddings

Computers don’t understand English alphabets, they need to be converted into numeric versions. So, each word of the sentence is converted into a corresponding vector of random numbers of a constant size (dimension). These vectors are known as the embedding vectors. The new matrix will be of [sentence_length, dimension] shape.

In the above example, we have 3 words and each word has an embedding vector associated with it having a length of 4. Hence , Shape=(3,4)[sentence_length, dimension]. In Natural Language processing, each number in word embedding has information related to linguistic features about the word. If we plot the word embedding of two words, say “grey” and “white” since these are colors, they will lie close to each other in n-dimensional space. These word embedding are initially random numbers and they get updated and appear to be closer or similar to each other during the training process since they belong to the category of colors.

Step 2:- Positional Encodings

In RNN, LSTM the words are fed in sequence, and hence it understands the order of words. But in transformer, all the words are passed in parallel. Recurrence in LSTM will require a lot of operations as the length of the sentence increases. To keep in mind the order of words, the concept of positional encodings is introduced. It’s a kind of encoding that denotes the position of words. In simple words, Add the positional encodings to our existing word embedding and that gives the final pre-processed embedding to be used in the encoder part. The concept of positional encoding is important and I’ll do a separate blog post on it and link it over here.

Note:- The size of positional embedding should be the same as our existing word embedding, That’s how it adds up.

2. Encoder Stack:-

There can be N number of encoders but their internal structure is the same. The embeddings that we discussed earlier will pass through 4 units:-

Multi-head attention
Add and norm layer
Feed-Forward layer
Add and norm

1. Multi-Head Attention

Attention mechanism is a way of creating representation for each word where each word in the sentence understands the semantic similarity with all other words in the same sentence including itself. This is achieved by dot product operation. In the paper, 8 attention heads are run in parallel, so multi-head attention. It actually helps in getting the semantic similarity relationship between all the words in the sentence. The output embeddings will consist of contextual information and will show how each word is related to other words in the sentence. And it can handle ambiguity as shown below. Here “it” is more related to “Elephant” than “banana”.

Representation of words in Self-attention mechanism

The multi-head attention mechanism will be discussed in more detail. I’ll do a separate blog post on it and link it over here.

2. Add and norm layer

The output embedding of the encoder is added to the original input embedding via the residual net. The purpose of residual is to make sure no original vital information is lost during the process of passing it through the encoders. This is then normalized.

3. Feedforward layer

The normalized embedding is then passed through a fully connected layer.

4. Add and norm layer

The output of the feedforward layer is added to the encoder’s output embedding via the residual net. The purpose of the residual net is to make sure no original vital information is lost during the process of passing it through the encoders. This is then normalized.

The above four steps are a part of encoders and are repeated multiple times (Nx as shown in the figure) which gives us pure contextual embeddings.

Pros

It provides state-of-the-art results in most NLP tasks. Even the basic pre-trained model gives good results.

Cons

The architecture is too big. It has some million parameters to be trained and takes a lot of computational power to train. The next post will show the decoder part. Until then Bye

via GIPHY