A Quick Dive into Deep Learning: From Neural Cells to BERT

By Shi En, Feng Yin, and Tiao Can.

11.11 Big Sale for Cloud. Get unbeatable offers with up to 90% off on cloud servers and up to $300 rebate for all products! Click here to learn more.

As a milestone in the natural language processing field, Bidirectional Encoder Representations from Transformers (BERT) did not appear out of nowhere. Rather, the development of this complex model followed a long line of development for deep learning and neural network models.

In this article, written by Shi En, Feng Yin, and Tiao Can, from the dialog algorithm team at Ant Financial, we will look at the evolution of some of the major deep learning models-from the very simplest to the most complex-that we have come to know and use nowadays.

That is, from a simple neural cell to one of the most complex model used today-the Bidirectional Encoder Representations from transformers (BERT) model-this article aims to discuss the ways in which deep learning in the area of natural language processing has evolved and developed as well as discuss the future direction of natural language processing based on the industry trends. We hope that, after reading this document, you will have gained a deeper understanding about deep learning.

But before we get into how neural networks evolved to the complex algorithms and models we see today, let’s first discuss what a neural network actually is.

Generally speaking, a neural network structure can be defined as consisting of the input layer, hidden layer, and output layer. The input layer hosts many of the features, whereas the output layer hosts predictions. A neural network is designed to fit a function f, with features meeting a prediction. During training, you can reduce the difference between the prediction and the actual label to modify the network parameters so that the current network can approach the ideal function f.

The Neural Cell

Figure 1. The Neural Cell

Shallow Neural Network (SNN)

Figure 2. A Shallow Neural Network.

Deep Learning Network (Multilayer Perceptron)

Figure 3. A Deep Learning Network

Generally speaking, a sufficiently wide network can fit any function. However, a multilayer perceptron can use fewer parameters to fit the function because neural cells on a multilayer perceptron can obtain more complex feature representations than those obtained by a shallow neural network.

The networks shown in Figure 2 and 3 are called fully connected networks. This is because neural cells at the hidden layers in these networks are related to the outputs of all neural cells at the upper layer. Corresponding to a fully connected network is a network on which neural cells are connected only to the outputs of some neural cells at the upper layer, for example, the convolutional neural network described below.

Convolutional Neural Network (CNN)

The same convolution core is multiplied with the inputs from left to right and from top to bottom to obtain the outputs of different strengths. Intuitively, the sensitivity of the convolution core to different data distributions of raw data is different. If the convolution core is construed as a certain pattern, a data distribution conforming to this pattern will produce a strong output, and a data distribution not conforming to this pattern will produce a weak output or even none output.

A convolution core is a pattern extractor, and multiple convolution cores are multiple pattern extractors. Using multiple feature extractors to extract and convert the features of raw data constitutes a layer of convolution.

Due to the nature of GPU memory, the example above Alex Net uses two GPUs to splice the model. In essence, the convolution layer is used for feature extraction, the max pooling layer is for strong feature extraction and decrease parameters, and the full connection layer is for involving all advanced features in the final classification decision.

Recurrent Neural Network (RNN)

On the recurrent neural network (RNN), x1, x2, x3, and xt are inputs in different time sequences, whereas the three matrices V, U, and W are shared. Meanwhile, the RNN saves its own status. The status changes with the input. Different inputs or inputs at different times vary in their effect on the status. Generally speaking, the status determines the final output.

Intuitively, we can understand that the RNN is a neural network (action) that can simulate any function. Together, the action and the RNN’s own historical memory turn the RNN into a Turing machine.

Long Short-Term Memory (LSTM)

The LSTM cell contains the forget gate (dot product, which determines what needs to be removed from the status), input update gate (bitwise addition, which determines what needs to be added to the status), and output gate (dot product, which determines the status output). Although LSTM looks complex, in reality it is essentially a matrix operation.

To simplify the operation, a gated recurrent units (GRU) variant of the LSTM is provided, as shown below:

Convolutional Neural Networks for Text Classification

TextCNN uses one-dimensional convolution to obtain the feature representations of n-gram in a sentence. TextCNN is excellent in extracting superficial features of texts. It is widely used in the short text field, such as the search or dialog field. It is often the first choice for intent classification. In the long text field, TextCNN uses the filter window to extract features. Therefore, TextCNN has limited capabilities in long-distance modeling and is not sensitive to the word order.

Convolution Core (Filter) and the N-Gram Feature


. The max-pooling operation is performed on the feature map, and the maximum value max {c} is used as the feature extracted by the filter. You can capture the most important features by selecting the maximum value of each feature map.

Each filter convolution core generates a feature. A TextCNN network contains many convolution cores of different window sizes, such as the commonly used filter size ∈{3,4,5} and featuremaps = 100 for each filter.

Enhanced Sequential Inference Model (ESIM)

Image source: Enhanced Long Short-Term Memory (LSTM) for Natural Language Inference

As shown in the figure, the enhanced sequential inference model is on the left part of the figure. The overall network structure is clear. The entire path consists of three steps:

  • Step 1: encoding layer. In this step, each token passes the pre-trained encodings through the bidirectional long short-term memory (Bi-LSTM) layer to obtain “new encodings”. This is to learn the context of each token through LSTM.
  • Step 2: local inference layer. Step 2 is a computing process of intra-sentence attention. By performing intra-sentence attention on the results that were obtained in step 1, you can obtain a new vector representation. Next, the vector changes before and after intra-sentence attention are calculated. This operation aims to further extract the changes of the local inference information before and after the attention operation, and capture some of the inference relationships, such as those before and after the attention operation.
  • Step 3: combined inference & prediction layer. Again, the extracted results are passed through Bi-LSTM, and Average and Maxpooling are used for pooling (specifically, average and max pooling are performed respectively, and then the results are concatenated). Finally, the full connection layer is added for Softmax to predict its probability.

Embedding from Language Model

  1. It can reflect the complex features of semantics and syntax.
  2. It can accurately produce appropriate semantics for different contexts.

Traditional Word2vec has only one fixed embedding expression for each word, and does not generate any embedding that carries the context information. Therefore, Word2vec cannot judge polysemous words based on the context. Each word in ELMo must be expressed through the multi-layer Long Short-Term Memory (LSTM) network in combination with the context. LSTM is created to capture context information. Therefore, ELMo can combine more contexts, and better handle the polysemy problem than Word2vec.

The network structure diagram of ELMo pre-training is similar to the traditional language model. It can be understood that the nonlinear layer in the middle is replaced by LSTM. The LSTM network is used to better extract the context information of each word in the current context, and meanwhile add the forward and backward context information.


, the forward language model predicts the kth word

based on the previous (k-1) words

. At position k, each LSTM layer outputs the context-dependent vector expression

, J = 1, 2 ,…, L. The output

at the top LSTM layer uses cross entropy loss to predict the next position


The backward language model reverses the sequence and uses subsequent words to predict previous words. Similar to the forward language model, for the given sequence

, the hidden layer output

at layer j is obtained through the prediction of the deep LSTM network at layer L.

The two-way language model splices the forward language model and the backward language model together to construct the maximum logarithm likelihood of forward and backward joins.


is a parameter of the sequential word vector layer, and

is a cross entropy layer parameter. Both parameters are shared in the training process.

The combination of the embedded language models uses the internal information of the multi-layer LSTM layer to compute the center word and uses the two-way language model at layer L to obtain (2L + 1) expression sets.


The higher-level LSTM vectors of biLMs capture the semantic information of words, whereas the lower-level LSTM vectors of biLMs capture the syntax information of words. The hierarchical effect of this depth model makes it possible to apply a set of word vectors to different tasks, because the amount of information required by each task is different. In addition, the number of LSTM layers should not be too large. A multi-layer LSTM network is difficult to train and suffers from an over-fitting problem. The following figure shows the experiment result of a multi-layer LSTM for text classification. As the number of LSTM layers increases, the model effect first increases and then decreases.

The Transformer

Attention was previously used for many natural language processing tasks to locate key tokens or features. For example, an attention layer can be added at the end of text classification to improve performance. Transformer is originated from the Attention mechanism and completely abandons the traditional recurrent neural networks. Its network structure is completely based on the Attention mechanism. Transformer can be built by stacking transformer layers. In our experiment, a total of 12 layers of encoder-decoder, namely 6 layers of encoder and 6 layers of decoder, were built, and a record-high BLEU value was achieved in machine translation.

The following figure shows the entire flow. In this flow, it is assumed that the N value equals to 2 and the actual N value of Transformer is 6.

  • Encoder phase: Input “Thinking Machines”, superimpose the Positional Encoding vector on the corresponding word vector, and perform Self-Attention on each position. Then, perform Add & Norm, which is completed in two steps: obtain the new residual connection through layer normalization, and perform feed forward full connection and Add & Norm on each position to obtain an Encoder Layer output. Repeat stacking twice, and finally output the Encoder Layer to the Encoder-Decoder Layer of the Decoder.
  • Decoder phase: First, apply the Masked Self-Attention Layer to the Decoder input. Then, perform Encoder-Decoder Attention on the output of the Encoder phase and the level-1 output of Decoder. Finally, connect to the FFN, stack both Decoder outputs, and connect to the full connection and Softmax to output the word that may appear on the position at the highest probability.

The structure of Transformer is easy to understand, but it contains many fragments, such as Multi-Head Attention, Feed Forward, Layer Norm, Positional Encoding, and so on.

Advantages of Transformer:

  • Parallel computing, which improves the training speed. This is a remarkable breakthrough compared to LSTM. During LSTM training, the computation of the current step depends on the hidden state of the previous step. This is a continuous process, and each computation can be performed only after the previous computation is completed, limiting the parallel capabilities of the model. Transformer does not use the LSTM structure. The computation of each step in the Attention mechanism only depends on the output of the previous layer rather than the information of the previous word. Therefore, words can be parallel with each other, and parallel computing is allowed during training, which increases the training speed.
  • One-step global contact capture. Information will be lost in the process of sequential computing. Although the structure of a gate mechanism like LSTM has relieved the problem of long-term dependency to some extent, LSTM remains less powerful in handling the extremely long-term dependency. Transformer uses the Attention mechanism to reduce the distance between any positions in the sequence to 1. This is very effective for solving the tough long-term dependency problem in natural language processing.

Comparison of the CNN, RNN, and Self-Attention:

  • Convolutional Neural Network (CNN): can see only local regions, and is suitable for images. To abstract higher-level information on images, the convolutional neural networks require only the local regions of the next-layer features. In terms of text, convolutional neural networks are suitable for extracting local features, and therefore is more suitable for short texts.
  • Recurrent Neural Network (RNN): can see all the history theoretically, and is suitable for text. However, the RNN has the vanishing gradient problem.
  • Self-Attention: does not have the vanishing gradient problem compared with the RNN. Compared with the convolutional neural networks, self-attention is more suitable for long texts, because it can see more distant information. The convolutional neural networks can see distant information only after several layers are stacked. In addition, the convolutional neural networks require many layers to complete abstraction, but Self-Attention can complete abstraction at the bottom layer, which is certainly advantageous to convolutional neural networks.

Bidirectional Encoder Representations form Transformers (BERT)

The following figure shows the network structure of BERT. The network structures of BERT and Transformer are identical. Assume that the dimension of the Embedding vector is that the input sequence contains n tokens, the input of a layer in the BERT model is a matrix, and its output is also a matrix. Therefore, N BERT layers can be easily connected in series. The large model of BERT uses a Transformer block with N = 24 layers.

Objective Functions

The masked language model (MLM) is used to train the deep two-way language to represent vectors. BERT uses a very direct way to mask some words in a sentence and let the encoder predict what these words are. The following figure shows the specific procedure. First, drop some words using a small probability mask. Then, use the language model to predict these words based on the context.

The specific training method of BERT is to randomly mask 15% of words as training samples.

  • 80% of the words are replaced by the masked token.
  • 10% of them are replaced by a random word.
  • And, the last 10% of them remain unchanged.

Only 15% of words are masked due to the performance overhead. Anyway, training a two-way encoder is slower than training a one-way encoder. The 80% and 20% of words are selected because the mask is made during pre-training. When a specific task, such as a classification task, is fine-tuned, the input sequence is not masked, resulting in gap and inconsistent tasks. 10% of words are replaced by a random word and another 10% remain unchanged because the encoder does not know which words need to be predicted and which words are incorrect. Therefore, the encoder is forced to learn the representative vectors of each token to make a compromise.

2. Next Sentence Prediction

A binary classification model is pre-trained to learn the relationship between sentences. The method for predicting the next sentence is helpful for learning the relationship between sentences.

The detailed training method is as follows. The ratio of positive samples to negative samples is 1:1, and 50% of sentences are positive samples. That is, for given sentences A and B, B is the next sentence in the actual context of A. A sentence is selected randomly in the corpus and is used as B, that is, the negative sample. Two specific token [CLS] and [SEP] are used to concatenate two sentences. This task outputs a prediction at the [CLS] position.

Input representations

During pre-training, [CLS] does not participate in masking. Therefore, this position attends all positions of the entire sequence, and the output of the [CLS] position is sufficient to represent the information of the entire sentence, similar to a global feature. The embedding corresponding to the word token focuses more on the semantics, syntax, and context representation of the token, similar to a local feature.

  • Position embeddings: The position encoding of Transformer is directly constructed through sin and cos. Position embeddings are the embedding vectors learned through the model and support a maximum of 512 dimensions.
  • Segment embeddings: In the pre-trained sentence pair prediction task, question-and-answer task, and similar matching task, it is necessary to distinguish the front and back sentences, input sentence pairs into the same sequence, and divide them into special markers [SEP]. Sentence A Embedding is added to each token of the first sentence, and Sentence B Embedding is added to the second sentence. In the experiment, EA = 1 and EB = 0.


The main contributions of BERT are as follows:

  • Effectiveness of pre-training: In this respect, BERT has changed the rules of the game. This is because compared with the design of complex and ingenious network structure, the experimental results of BERT language representations pre-trained on massive unsupervised data and the simple network model with a small amount of fine-tuning training data have achieved great advantages.
  • Network depth: The representation of word vectors obtained based on the DNN language model, such as the neural network language model (NNLM) and continuous bags-of-words (CBOW) model, has been very successful in natural language processing. Meanwhile, the BERT pre-training network is based on the Transformer Encoder, which can be very deep.
  • Two-way language model: Before BERT, the main limitation of ELMo and GPT is that the standard language model is one-way. GPT uses the Decoder structure of Transformer and only considers the above information. ELMo’s left-to-right language model and right-to-left language model are actually independently trained. Sharing embeddings and splicing LSTM in two directions do not really represent the context. ELMo is essentially still one-way. In addition, multi-layer LSTM is difficult to train.
  • Objective function: Compared with the language model task that only predicts the next word, to train a language model with more information, you need to make the language model complete more complex tasks. BERT mainly completes cloze filling and sentence pair prediction tasks. That is, there are two losses: Masked Language Model and Next Sentence Prediction.


At the same time, in the online deployment process, the BERT time consumption is tested, and the test results on the pressure test data are for reference. For our question and answer query:

  1. The number of BERT layers is linearly related to the time consumption. However, when the number of multiple heads increases, the time consumption does not increase significantly.
  2. Intent understanding for short text query relies more on the shallow grammatical and semantic features. Thus, the BERT layer has less influence on the quasi-recall of the model.
  3. The number of heads in multi-head attention determines how many perspectives can be used to understand query. In our experiments, reducing the number of heads has a slightly greater impact on quasi-recalls than reducing the number of layers, but the time consumed is not significantly reduced.

In the image field, AlexNet opens the door of in-depth learning, and ResNet is a milestone of in-depth learning in the image field.

With the rise of Transformer and BERT, the network is also evolving to 12 or 24 layers, and SOTA is achieved. BERT has proved that in natural language processing, the effect of deep networks is better than that of shallow networks.

In the natural language field, Transformer opens the door of deep networks, and BERT becomes a milestone in natural language processing.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.