Transformer (machine learning model)

The Transformer is a deep machine learning model introduced in 2017, used primarily in the field of natural language processing (NLP).[1] Like recurrent neural networks (RNNs), Transformers are designed to handle ordered sequences of data, such as natural language, for various tasks such as machine translation and text summarization. However, unlike RNNs, Transformers do not require that the sequence be processed in order. So, if the data in question is natural language, the Transformer does not need to process the beginning of a sentence before it processes the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during training.[1]

Since their introduction, Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM) in many cases. Since the Transformer architecture facilitates more parallelization during training computations, it has enabled training on much more data than was possible before it was introduced. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer), which have been trained with huge amounts of general language data prior to being released, and can then be fine-tune trained to specific language tasks.[2][3]

Background

Before the introduction of Transformers, most state-of-the-art natural language processing systems relied on gated recurrent neural networks (RNNs), such as LSTMs, with added attention mechanisms. The Transformer built upon these attention technologies without using an RNN structure, highlighting the fact that the attention mechanisms alone, without recurrent sequential processing, are powerful enough to achieve the performance of RNNs with attention.

Gated RNNs process tokens sequentially, maintaining an internal state vector that contains a latent representation of the data seen after every token. To process the ${\textstyle n^{th}}$ token, the model combines the internal state representing the sentence up to token ${\textstyle n-1}$ with the information of the new token to create a new latent state, that represents the sentence up to token ${\textstyle n}$ . Theoretically, the information from one token can propagate arbitrarily far down the sequence, if at every point the internal state continues to encode information about the token. In practice, however, this mechanism is imperfect: due in part to the vanishing gradient problem, the model's latent state at the end of a long sentence often does not contain precise, extractable information about early tokens.

This problem was addressed by the introduction of attention mechanisms. Attention mechanisms let a model directly look at, and draw from, the latent state at any earlier point in the sentence. The attention layer can access all previous latent states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens. A clear example of the utility of attention is in machine translation. In an English-to-French machine translation system, the first word of the French output most probably depends heavily on the beginning of the English input. However, in a classic encoder-decoder LSTM model, in order to produce the first word of the French output the model is only given the state vector of the last English word. Theoretically, this vector can encode information about the whole English sentence, giving the model all necessary knowledge, but in practice this information is often not well preserved. If we introduce an attention mechanism, the model can instead learn to attend to the latent states of early English tokens when producing the beginning of the French output, giving it a much more precise image of what it is translating.

When added to RNNs, attention mechanisms led to large gains in performance. The introduction of the Transformer brought to light the fact that attention mechanisms were powerful in themselves, and that sequential recurrent processing of data was not necessary for achieving the performance gains of RNNs with attention. The Transformer uses an attention mechanism without being an RNN, processing all tokens at the same time and calculating attention weights between them. The fact that Transformers do not rely on sequential processing, and lend themselves very easily to parallelization, allows Transformers to be trained more efficiently on larger amounts of data.

Architecture

The Transformer consists of two main components: a set of encoders chained together and a set of decoders chained together. The function of each encoder is to process its input vectors to generate what are known as encodings, which contain information about the parts of the inputs which are relevant to each other. It passes its set of generated encodings to the next encoder as inputs. Each decoder does the opposite, taking all the encodings and processing them, using their incorporated contextual information to generate an output sequence.[4] To achieve this, each encoder and decoder makes use of an attention mechanism, which for each input, weighs the relevance of every input and draws information from them accordingly when producing the output.[5] Each decoder also has an additional attention mechanism which draws information from the outputs of previous decoders, before the decoder draws information from the encodings. Both the encoders and decoders have a final feed-forward neural network for additional processing of the outputs, and also contain residual connections and layer normalization steps.[5]

Scaled Dot-Product Attention

The basic building blocks of the Transformer are scaled dot-product attention units. When a sentence is passed into a Transformer model, attention weights are calculated between every token simultaneously. The attention layer produces embeddings for every token in context that contain information not only about the token itself, but also a weighted combination of other relevant tokens weighted by the attention weights.

Concretely, for each attention unit the Transformer model learns three weight matrices: the query weights $W_{Q}$ , the key weights $W_{K}$ , and the value weights $W_{V}$ . For each token $i$ , the input word embedding $x_{i}$ is multiplied with each of the three weight matrices to produce a query vector $q_{i}=x_{i}W_{Q}$ , a key vector $k_{i}=x_{i}W_{K}$ , and a value vector $v_{i}=x_{i}W_{V}$ . Attention weights are calculated using the query and key vectors: the attention weight $a_{ij}$ from token $i$ to token $j$ is the dot product between $q_{i}$ and $k_{j}$ . The attention weights are divided by the square root of the dimension of the key vectors, ${\sqrt {d_{k}}}$ , which stabilizes gradients during training, and passed through a softmax which normalizes the weights to sum to $1$ . The fact that $W_{Q}$ and $W_{K}$ are different matrices allows attention to be non-symmetric: if token $i$ attends to token $j$ (i.e. $q_{i}\cdot k_{j}$ is large), this does not necessarily mean that token $j$ will attend to token $i$ (i.e. $q_{j}\cdot k_{i}$ is large). The output of the attention unit for token $i$ is the weighted sum of the value vectors of all tokens, weighted by $a_{ij}$ , the attention from $i$ to each token.

The attention calculation for all tokens can be expressed as one large matrix calculation, which is useful for training due to computational matrix operation optimizations which make matrix operations fast to compute. The matrices $Q$ , $K$ and $V$ are defined as the matrices where the $i$ th rows are vectors $q_{i}$ , $k_{i}$ , and $v_{i}$ respectively.

${\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}$

Multi-head Attention

One set of $\left(W_{Q},W_{K},W_{V}\right)$ matrices is called an attention head, and each layer in a Transformer model has multiple attention heads. While one attention head attends to the tokens that are relevant to each token, with multiple attention heads the model can learn to do this for different definitions of "relevance". Research has shown that many attention heads in Transformers encode relevance relations that are transparent to humans. For example there are attention heads that, for every token, attend mostly to the next word, or attention heads that mainly attend from verbs to their direct objects.[6] Since Transformer models have multiple attention heads, they have the possibility of capturing many levels and types of relevance relations, from surface-level to semantic. The multiple outputs for the multi-head attention layer are concatenated to pass into the feed-forward neural network layers.

Encoder

Each encoder consists of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism takes in a set of input encodings from the previous encoder and weighs their relevance to each other to generate a set of output encodings. The feed-forward neural network then further processes each output encoding individually. These output encodings are finally passed to the next encoder as its input, as well as the decoders.

The first encoder takes positional information and embeddings of the input sequence as its input, rather than encodings. The positional information is necessary for the Transformer to make use of the order of the sequence, because no other part of the Transformer makes use of this.[1]

Decoder

Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders.[1][5]

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. Since the transformer should not use the current or future output to predict an output though, the output sequence must be partially masked to prevent this reverse information flow.[1] The last decoder is followed by a final linear transformation and softmax layer, to produce the output probabilities over the vocabulary.

Training

Transformers typically undergo semi-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a much larger dataset than fine-tuning, due to the restricted availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:

Applications

The Transformer finds most of its applications in the field of natural language processing (NLP), for example the tasks of machine translation and time series prediction.[8] Many pretrained models such as GPT-2, BERT, XLNet, and RoBERTa demonstrate the ability of Transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications.[2][3][9] These may include:

References

Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
"Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. Retrieved 2019-08-25.
"Better Language Models and Their Implications". OpenAI. 2019-02-14. Retrieved 2019-08-25.
"Sequence Modeling with Neural Networks (Part 2): Attention Models". Indico. 2016-04-18. Retrieved 2019-10-15.
Alammar, Jay. "The Illustrated Transformer". jalammar.github.io. Retrieved 2019-10-15.
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (August 2019). "What Does BERT Look at? An Analysis of BERT's Attention". Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics: 276–286. doi:10.18653/v1/W19-4828.
Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355. arXiv:1804.07461. Bibcode:2018arXiv180407461W. doi:10.18653/v1/w18-5446.
Allard, Maxime (2019-07-01). "What is a Transformer?". Medium. Retrieved 2019-10-21.
Yang, Zhilin Dai, Zihang Yang, Yiming Carbonell, Jaime Salakhutdinov, Ruslan Le, Quoc V. (2019-06-19). XLNet: Generalized Autoregressive Pretraining for Language Understanding. OCLC 1106350082.CS1 maint: multiple names: authors list (link)
Monsters, Data (2017-09-26). "10 Applications of Artificial Neural Networks in Natural Language Processing". Medium. Retrieved 2019-10-21.
Rives, Alexander; Goyal, Siddharth; Meier, Joshua; Guo, Demi; Ott, Myle; Zitnick, C. Lawrence; Ma, Jerry; Fergus, Rob (2019). "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences". doi:10.1101/622803. Cite journal requires |journal= (help)

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[:0-1] Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].

[:6-2] "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. Retrieved 2019-08-25.

[:7-3] "Better Language Models and Their Implications". OpenAI. 2019-02-14. Retrieved 2019-08-25.

[4] "Sequence Modeling with Neural Networks (Part 2): Attention Models". Indico. 2016-04-18. Retrieved 2019-10-15.

[:1-5] Alammar, Jay. "The Illustrated Transformer". jalammar.github.io. Retrieved 2019-10-15.

[6] Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (August 2019). "What Does BERT Look at? An Analysis of BERT's Attention". Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics: 276–286. doi:10.18653/v1/W19-4828.

[:8-7] Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355. arXiv:1804.07461. Bibcode:2018arXiv180407461W. doi:10.18653/v1/w18-5446.

[8] Allard, Maxime (2019-07-01). "What is a Transformer?". Medium. Retrieved 2019-10-21.

[9] Yang, Zhilin Dai, Zihang Yang, Yiming Carbonell, Jaime Salakhutdinov, Ruslan Le, Quoc V. (2019-06-19). XLNet: Generalized Autoregressive Pretraining for Language Understanding. OCLC 1106350082.CS1 maint: multiple names: authors list (link)

[:9-10] Monsters, Data (2017-09-26). "10 Applications of Artificial Neural Networks in Natural Language Processing". Medium. Retrieved 2019-10-21.

[11] Rives, Alexander; Goyal, Siddharth; Meier, Joshua; Guo, Demi; Ott, Myle; Zitnick, C. Lawrence; Ma, Jerry; Fergus, Rob (2019). "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences". doi:10.1101/622803. Cite journal requires |journal= (help)