Self-Attention Mechanisms in Natural Language Processing

10 min readSep 12, 2018

Over the last few years, Attention Mechanisms have found broad application in all kinds of natural language processing (NLP) tasks based on deep learning.

With more in-depth research into Attention Mechanisms, the types of “attention” raised by researchers multiplied. In June 2017, Google’s machine translation team published a paper at arXiv entitled “Attention is All You Need,” which attracted the attention of the industry. Self-attention mechanisms became a hot topic in neural network attention research and proved useful in a wide variety of tasks.

In this article, we will explore various forms of Attention Mechanisms in NLP and their associated applications. We will look at works by Dr. Zhang Junlin’s blog “Self-Attention Mechanisms in Deep Learning” (2017 v.), and the article written by Su Jianlin, “Understanding Attention is All You Need” along with the attached code examples.

Background

In their earliest days, Attention Mechanisms were used primarily in the field of visual imaging, beginning in about the 1990s. However, they didn’t become trendy until Google Mind team issued the paper “Recurrent Models of Visual Attention” in 2014. In the paper, they applied Attention Mechanisms to the RNN model for image classification.

Later, researchers experimented with Attention Mechanisms for machine translation tasks. Dzmitry Bahdanau et al. used this method in the paper “Neural Machine Translation by Jointly Learning to Align and Translate” to execute translations concurrently. Their work was the first to apply Attention Mechanisms to the field of Natural Language Processing. This paper caught on, and Attention Mechanisms then became common in NLP tasks based on neural networks such as RNN/CNN.

In 2017, Google’s machine translation team used self-attention mechanisms heavily to learn text representation in the essay “Attention is All You Need.” Self-attention mechanisms have also become a hot research topic, and its use is getting explored in all kinds of NLP tasks.

The image below displays the general trend of Attention Mechanism research:

The essence of Attention Mechanisms is that they are an imitation of the human sight mechanism. When the human sight mechanism detects an item, it will typically not scan the entire scene end to end; rather it will always focus on a specific portion according to the person’s needs. When a person notices that the object they want to pay attention to typically appears in a particular part of a scene, they will learn that in the future, the object will look in that portion and tend to focus their attention on that area.

Next, I will discuss the Attention Computation method typically used in NLP, hinging in large part on some of the images found in Dr. Zhang Junlin’s Attention Mechanisms in Deep Learning (2017 v.).

In essence, I will describe Attention Parameters as a projection of a query for a series of key-value pairs as in the below image:

Calculating attention comes primarily in three steps. First, we take the query and each key and compute the similarity between the two to obtain a weight. Frequently used similarity functions include dot product, splice, detector, etc. The second step is typically to use a softmax function to normalize these weights, and finally to weight these weights in conjunction with the corresponding values and obtain the final Attention.

In current NLP work, the key and value are frequently the same, therefore key=value.

Attention Is All You Need

The paper “Attention is All You Need” was submitted at the 2017 arXiv by the Google machine translation team, and finally at the 2017 NIPS.

The major points of this article are:

Dissimilarly from popular machine translation techniques in the past, which used an RNN and Seq2Seq model framework, the Attention Mechanism in the essay replaces RNN to construct an entire model framework.
The Multi-Headed Attention Mechanism method uses Multi-Headed self-attention heavily in the encoder and decoder.
In the WMT2014 corpus English-German and English-French tasks, this system has garnered impressive results and can get trained faster than popular models.

The structure of the model is shown in the image below. The entire decoder stack constructs N blocks formed by a Multi-Headed Attention child layer and feed-forward neural child network in a network block in the encoder built by a group of encoder and decoder. However, similar to an encoder, one of the network blocks in the decoder has one more Multi-Headed Attention layer.

To better optimize the deep network, the entire network uses a residual connection and applies Add & Norm to the layer.

Before we get into Multi-Headed Attention, we should first look at the concept of Scaled Dot-Product Attention raised in the article.

Compared to the standard form of Attention described in the background knowledge above, Scaled Dot-Product Attention is a type of Attention that utilizes Scaled Dot-Product to calculate similarity. The difference is that it has an extra dimension (K dimension) for adjustment that prevents the inner product from becoming too large.

Multi-Head Attention structure is outlined in the image below, where the query, key, and value first go through a linear transformation and then enters into Scaled-Dot Attention. Here, the attention is calculated h times, making it so-called Multi-Headed, where each iteration computes ahead. The W parameter is different each time Q, K, and V undergo a linear transformation. Subsequently, the result from the h iterations of Scaled-Dot Transformation Attention is spliced, and the value obtained by further linear transformation is the result of the Multi-Head Attention.

We can see that the thing that makes the Multi-Head Attention proposed by Google is that the calculation is performed h times instead of once. The essay mentions that this allows the model to learn relevant information in different representative child spaces. Then it uses the Attention visualization as verification.

Applying Attention throughout the Entire Model

As shown in the image below, we first use Multi-Head Attention as a connection between the encoder and decoder. The key, value, and query respectively are the layer output of the encoder (key=value here), and the output of the Multi-Head Attention in the decoder.

It’s the same as the Attention used in the popular machine translation model, however, in that it uses decoder and encoder Attention to align translations. It uses Multi-Headed Self-Attention between the encoder and decoder to learn the representatives of the text.

In Self-Attention or K=V=Q, if the input is, for example, a sentence, then each word in the sentence needs to undergo Attention computation. The goal is to learn the dependencies between the words in the sentence and use that information to capture the internal structure of the sentence.

As for a reason behind using Self-Attention mechanisms, the paper brings up three main points (complexity of each layer, run-ability, distant dependency learning) and gives comparisons to the complexity of RNN and CNN computations.

We can see that if the input sequence n is smaller than the representative dimension d, then Self-Attention is advantageous regarding the time complexity of each layer.

When n is larger, the author provides a solution in Self-Attention (restricted), where not every word undergoes Attention calculation, instead only r words undergo the calculation.

Regarding concurrency, Multi-Head Attention and CNN both depend on calculations from the previous instance and features better concurrency than RNN.

In distant dependencies, since self-attention is applied to both each word and all words together, no matter how distant they are, the longest possible path is one so that the system can capture distant dependency relationships.

Finally, let’s take a look at some experiment results working with WMT2014 English-German and English-French machine translation tasks where results were far ahead of the pack and training speed was faster than other models.

We can see from the hyperparameter experiment on the model that the h hyperparameter in Multi-Head Attention cannot be too small, and will decline if it’s too large. Overall it is better than model be larger than smaller, and dropout can help with the issue of overfitting.

The author also applies this model to grammar analysis tasks where it performed well.

Last, let’s take a look at the effects of Attention visualization (here different colors represent different Attention results, the deeper the color, the larger the Attention value). We can see that Self-Attention is capable of learning the distant dependencies within the phrase “making… More difficult.”

Comparing two heads and one head, we can see that the single head for “its” is only capable of learning the dependent relationship to “law. However, two heads not only found the dependent relationship between “its” and “law” but between “its” and “application.” Multiple heads can learn related information from representative spaces.

Self-Attention in NLP

Deep Semantic Role Labeling with Self-Attention

Paper: https://www.paperweekly.site/papers/1786

Source code: https://github.com/XMUNLP/Tagger

This essay comes from the work done by Tan and associates from Xiamen University in AAAI2018. They applied Self-Attention to Semantic Role Labeling tasks with impressive results.

In this essay, the authors treat SRL as an issue of sequence labeling and use BIO tags for the labeling process. They then raised the idea of using a Deep Attentional Neural Network for labeling. The structure is as follows:

In each network block, there is a combination of RNN/CNN/FNN child layer and a Self-Attention child layer. Finally, they used softmax as a method of label classification for sequence labeling.

This model has produced terrific results in both the CoNLL-2005 and CoN11–2012 SRL datasets. We are aware that in the matter of sequence labeling, labels have a dependent relationship. For example, label I should appear after label B and not after label O.

The currently favored model of sequence labeling is the BiLSTM-CRF model which uses CRF for full label optimization. In comparative experiments, the He et al and Zhou and Xu models used CRF and constrained decoding respectively to handle this issue.

We can see that this thesis uses only Self-Attention. The authors believe that the Attention at the top layer of the model is capable of learning the implicit dependent information between the tags.

Simultaneously Self-Attending to All Mentions for Full-Abstract Biological Relation Extraction

Paper: https://www.paperweekly.site/papers/1787

Authors: Patrick Verga / Emma Strubell / Andrew McCallum

This essay by Andrew McCallum’s team outlines their application of Self-Attention to relationship extraction tasks in the field of biology and should get accepted by the NAACL 2018. The authors of this essay proposed a biological relationship extraction model on the document level and presented no small amount of work that interested readers are welcome to explore in more detail in the essay.

Here we will simplify the portion where they applied Self-Attention. The model outlined in the paper is shown in the image below. The authors used Google to extract transformers containing Self-Attention and enact representation learning on the input text. The transformer is slightly different from the original. In that, they used a CNN with a window of 5 to replace the original FNN.

Let’s focus on the results of the experiments concerning Attention for a moment. They showed excellent results on a Chemical Disease Relations dataset. After removing the Self-Attention layer we can see a massive drop in results, and using a CNN with a window of 5 instead of the original FNN makes results on this dataset even more apparent.

Summary

As a final summary, Self-Attention can be a special situation for typical Attention. In Self-Attention, Q=K=V and Attention is applied to the unit of each sequence and the unit of all sequences.

The Multi-Head Attention method raised by Google uses multiple iterations of computation to capture relevant information from different child spaces. What makes Self-Attention unique is that it ignores the distance between words and directly computes dependency relationships, making it capable of learning the internal structure of a sentence and more merely calculating in parallel.

Read similar articles and learn more about Alibaba Cloud’s products and solutions at www.alibabacloud.com/blog.

References

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Pan, S.; and Zhang, C. Disan: Directional self-attention network for rnn/cnn-free language understanding. arXiv preprint arXiv:1709.04696, 2017.
Im, Jinbae, and Sungzoon Cho. Distance-based Self-Attention Network for Natural Language Inference. arXiv preprint arXiv:1712.02047, 2017.
Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. arXiv preprint arXiv:1803.02155, 2018.

Reference:

https://www.alibabacloud.com/blog/self-attention-mechanisms-in-natural-language-processing_593968?spm=a2c4.11999894.0.0