22 Nlp Fundamentals Explaining Language Models and Attention Mechanisms Part 2

22 NLP Fundamentals Explaining Language Models and Attention Mechanisms - Part 2 #

Hello, I’m Fang Yuan.

In the previous lesson, we explored several classic problems in NLP tasks together. We found that although these methods have their own advantages, some of them fail to effectively capture the sequential and semantic relationships between words and phrases in the text. In other words, they cannot quantitatively represent and differentiate the importance of different parts of the language content.

So, is there a method that can transform language into a mathematical computation process, such as using probability and vectors, to represent the generation and analysis of language? The answer is yes, and that is what we are going to discuss in this lesson: language modeling.

But how can we differentiate the importance of different parts of language? I will explain it to you from the perspective of the most popular attention mechanism in deep learning.

Language Model #

A language model is a mathematical abstraction of language based on objective language facts. It is a kind of correspondence relationship. In many NLP tasks, there is a question: for a given concept or expression, determine which representation result is the most likely.

Let’s look at two examples.

First, consider the translation task: “今天天气很好” (Today’s weather is very good). Possible translations are: res1 = “Today is a fine day.” res2 = “Today is a good day.” We want to determine which translation, res1 or res2, has a higher probability, P(res1) or P(res2).

Another example is a question-answering system: “我什么时候才能成为亿万富翁” (When can I become a billionaire). Possible answers are: ans1 = “白日做梦去吧” (Daydreaming). ans2 = “红烧肉得加点冰糖” (Braised pork should be added with some rock sugar). The answer returned should be the one that is most relevant to the question itself, in this case, the first answer.

For the above examples, it is natural for us to think of using probabilistic methods to establish a language model. This type of model is called a statistical language model.

Statistical Language Model #

The principle of a statistical language model is to calculate the probability that a sentence is a natural language sentence. Over the years, experts and scholars have constructed numerous language models, among which the most classic one is the n-gram language model based on the Markov assumption, which is also widely adopted.

Next, let’s start with a simple abstract example to understand what a statistical language model is.

Given a sentence S = w1, w2, w3, …, wn, the probability of generating this sentence is: p(S) = p(w1, w2, w3, w4, w5, …, wn). Then, according to the chain rule, we can further derive: p(S) = p(w1)p(w2|w1)p(w3|w1,w2)…p(wn|w1,w2,…,wn-1). This p(S) is the statistical language model we want.

Now, the question arises. You will notice that the process from p(w1, w2, w3, w4, w5, …, wn) to p(w1)p(w2|w1)p(w3|w1,w2)…p(wn|w1,w2,…,wn-1) is essentially a probability propagation process. There is a fundamental problem that has not been solved, which is the sparsity of data in the corpus. Many parts of the formula have no statistical values, so they become 0, and the number of parameters is really huge.

What can we do? Let’s take a look at this sentence: “我们本节课将会介绍统计语言模型及其定义” (In this class, we will introduce statistical language models and their definitions). Who defines the word “其” (“their”)? It is defined by “语言模型” (“language models”) before it. So we find that for a word in the text, its probability of occurrence is largely determined by the preceding one or more words, and this is the Markov assumption.

With the Markov assumption, we can further simplify p(wn|w1,w2,…,wn-1) in the previous formula. The degree of simplification depends on how many preceding words you think a word is determined by. If it is determined only by the preceding word, then it becomes p(wn|wn-1), which is called a bigram. If it is determined by the preceding two words, then it becomes p(wn|wn-2,wn-1), which is called a trigram.

Of course, if you think that the occurrence of a word is determined only by itself and is independent of other words, it becomes the simplest form: p(wn), which is called a unigram.

Now we know that the core of a statistical language model based on the Markov chain lies in the conditional probability based on statistics. To compute the generation probability of a sentence, we only need to count the co-occurrence conditional probabilities of each word and its preceding n words, and then perform a simple multiplication to obtain the final result.

Neural Network Language Model #

The n-gram model reduces the number of parameters to some extent, but if n is large or the corpus is small, the problem of data sparsity still cannot be solved well.

This is similar to putting the text of “Water Margin” into the model for statistical training and then asking about the relationship between Lin Chong and Pan Jinlian. It is difficult to answer because the n-gram statistical model cannot collect the co-occurrence text of the two. Statistical methods cannot solve this sparse problem. So what do we do? This is where the neural network language model shines.

In essence, the neural network language model also models language using n-grams, but the learning of neural networks is not done through counting statistics, but through continuous updates of neural network neurons.

So how is it done? First, we need to define a vector space. Suppose this space is one hundred dimensions, which means that for each word, we can represent it with a one hundred-dimensional vector, such as V(“China”) = [0.2821289, 0.171265, 0.12378123, …, 0.172364].

In this way, we can evaluate the relationship between any two words using distance calculation. For example, if we use cosine distance to calculate the distance between the words “China” and “Beijing”, it is highly likely to be closer than the distance between “China” and “watermelon”.

What are the benefits of doing this? First, the distance between words can be used as a measure of their similarity. Second, the vector space involves many mathematical calculations, such as the classic equation V(“king”) - V(“queen”) = V(“man”) - V(“woman”), which provides more semantic associations between words.

In addition to dimensionality, to determine the vector space, we also need to determine how many “points” or words there are in this space. Generally, we retain words that appear more than a certain threshold in the corpus and consider the number of words left as the quantity of space points.

Let’s take a look at how this is done in practice. We only need to create an M*N matrix, where M represents the number of words and N represents the dimensionality of the words. We randomly initialize each value in this matrix, which is called the word vector matrix.

Since it is randomly initialized, it means that this vector space cannot be used as our language model. So we need to find a way to let this matrix learn the content. As shown in the figure below:

Earlier, we said that the neural network language model also uses n-grams to model language. Assuming the length of our n-gram is n, then we find the vectors of the corresponding n-1 words from the word vector matrix, pass them through several layers of neural networks (including activation functions), and map these n-1 word vectors to the corresponding conditional probability distribution space. Finally, the model can learn the mapping relationship parameters of word vectors and the conditional probability parameters of context words through parameter updates.

In short, we use n-1 words to predict the nth word, and use the predicted word vectors to calculate the loss function with the true word vectors and update them. By continuously updating the word vector matrix, we can obtain a language model. This type of neural network language model is called a feed-forward neural network language model.

In addition to feed-forward neural network language models, there is also one called LSTM-based language model. In the next lesson, we will use LSTM to complete a sentiment analysis task project, further refining the training process of the LSTM neural network language model, and also using the word vector matrix mentioned earlier.

Now, let’s compare the differences between statistical language models and neural network language models. The essence of the statistical language model is based on the statistical frequency of word co-occurrences, while the neural network language model assigns a position in the vector space to each word as a representation and calculates their dependencies in a high-dimensional continuous space. Relatively speaking, the representation and non-linear mapping of neural networks are more suitable for modeling natural language.

Attention Mechanism #

If you are attentive enough, you will notice that in the neural network language model introduced earlier, we seem to have overlooked one point, that is, for a sentence composed of n words, the importance of different words is not the same. Therefore, we need to allow the model to “pay attention” to those relatively more important words. This approach is called the attention mechanism, also known as the Attention mechanism.

Since it is a mechanism, it is not an algorithm, but rather a mindset for constructing a network. The most classic paper on the attention mechanism is the well-known “Attention Is All You Need”, if you are interested, you can read this paper on your own.

Because our column focuses on hands-on practice and emphasizes the practical application of various machine learning theories, we will not delve too much into the mathematical principles inside it, but we still need to know how it works.

We start with an example, such as “I went to KFC at noon today and ate three hamburgers.” In this sentence, you must pay more attention to the words “I,” “KFC,” “three,” and “hamburgers,” but did you notice the word “went”?

In fact, this is exactly what the Attention mechanism does: finding the most important content. It assigns different attention or weights to different positions in the input (or intermediate layers) of the network, and then through learning, the network gradually knows which are the key points and which are the content that can be ignored.

In the previous neural network language model, the vector of a specific word is fixed, but now it is different because of the attention mechanism. For the same word, its vector representation is different in different contexts.

The following figure is an example of the combination of the Attention mechanism and RNN. The red box in the figure shows the unfolding mode of RNN. We can see that the vectors of the four words “I/love/geek” are passed along the direction of the green arrow. Each word will have a hidden state h, which is the blue box above the input node. During this process, the weights of each state are the same and do not vary in size.

The blue box is the part added by the Attention mechanism, and each α is the weight of each state h. With this weight, all states h can be weighted and summed in the softmax to obtain the final output C. This C can provide more calculation basis for subsequent RNN to judge the weight.

You see, the principle of this attention mechanism is actually very simple, but also clever. Just by adding very few parameters, the model can figure out who is important and who is less important on its own. Now let’s take a look at the abstracted Attention, as shown in the following figure, which you must have seen in many introductions to attention.

Here, the input is query (Q), key (K), value (V), and the output is the attention value. Compared with the figure where Attention is combined with RNN, query is the state Zt-1 passed in from the previous time step, and this Zt-1 is the encoding output of the previous time step. Key is the various hidden states h, and value is also the hidden states h (h1, h2… hn). The model calculates the weight by the matching formula of Q and K, and then combines it with V to obtain the output, which is equivalent to calculating the current output and the matching degree of all inputs. The formula is as follows:

\[\\operatorname{Attention}(Q, K, V)=\\operatorname{softmax}(\\operatorname{sim}(Q, K)) V\]

Currently, there are mainly two types of attention: soft attention and hard attention. Hard attention focuses on a very small area near the current word, while soft attention focuses on a larger and wider range, and is more commonly used.

As an application user, understanding the basic principles of Attention is sufficient to use it, because there are already many pre-training models based on Attention that can be used directly. If you want to know more, you can further discuss with me in the comment area and the discussion group.

Summary #

Congratulations on completing today’s learning tasks.

Today, we learned about the basic principles of language models and how attention mechanisms empower models to better understand and capture the key content of texts.

With language models, we can convert language into a computable form, establishing more direct connections between words. With attention mechanisms, we can guide the model to focus on important content, thus improving the effectiveness of language models. If you are interested in the mathematical principles behind attention mechanisms and want to delve deeper into this topic, I recommend that you read the paper “Attention Is All You Need”.

So far, we have covered the foundational issues and important concepts in NLP tasks over the past two lessons. In the upcoming lessons, we will move on to the hands-on practical phase.

First, we will work on a sentiment analysis project based on LSTM. Through this project, you will learn about the construction methods of language models and be able to implement a model with sentiment analysis capabilities. Afterward, we will use the popular BERT model to build a highly effective text classification model. Stay tuned!

Practice for each lesson #

What is the appropriate length of word vectors? Is longer better?

Feel free to interact with me in the comments section, and I also recommend that you share this lesson with more colleagues and friends.