21 Nlp Fundamentals Explaining Principles of Natural Language Processing and Common Algorithms Part 1

21 NLP Fundamentals Explaining Principles of Natural Language Processing and Common Algorithms - Part 1 #

Hello, I’m Fang Yuan.

In the previous courses, we learned together about methods related to image classification and image segmentation. We also had a hands-on project experience. After completing this part, I believe you have gained a deep understanding of deep learning image algorithms.

However, in actual projects, besides image algorithms, there is another major problem type, which is text or language-related algorithms. The algorithms or processing methods that enable programs to understand human language expressions are collectively referred to as Natural Language Processing (NLP).

In this lesson, we will first learn about the principles and common algorithms of natural language processing. Through this part of the study, you will be able to easily come up with your own solutions when encountering common NLP problems in the future. Don’t worry if the content of principles and algorithms seems too theoretical. I will combine my own experience from the perspective of practical applications to help you establish an overall understanding of NLP.

NLP Applications are Everywhere #

The field of NLP has a wide range of research areas, all of which are related to linguistics. Generally speaking, the more common directions in linguistics include: stemming, lemmatization, word segmentation, part-of-speech tagging, named entity recognition, semantic disambiguation, syntactic analysis, coreference resolution, discourse analysis, and so on.

At this point, you may feel that these seem a bit too academic and specialized, involving the study of language structure, composition, and even nature. Yes, these are all applications of NLP in linguistics, which may give people the feeling of being more inclined towards research.

In fact, NLP also has many research contents that focus on “processing” and “application” aspects. Some common examples include machine translation, text classification, question-answering systems, knowledge graphs, information retrieval, and so on.

Let me give you an example, and you will know how important natural language processing is. We often use search engines in our daily lives. When you open a web page and enter the keywords you want to know in the search box, the backend algorithm of the search engine starts a whole set of complex algorithms. This includes several important aspects, which we can examine using an example.

In the search engine input box, enter the text “亚洲的动wu”. The displayed contents are shown in the following image. Don’t be fooled by the simple search action; the search system has to do a lot of work.

First, the search engine needs to parse the content (query) you entered, which involves the previously mentioned word segmentation, named entity recognition, semantic disambiguation, and other aspects. It also involves query correction because you entered pinyin instead of Chinese characters, so it needs to be rewritten in the correct form.

After a series of algorithms, the system recognizes that your request is to find search results related to animals, and these results should be limited to those that live in Asia.

Then, the system starts searching for relevant entities in the database (or a storage cluster), which involves information retrieval, knowledge graphs, and other content related to entity queries and filtering based on constraints.

Finally, careful students who compare the search results will notice that sometimes search engines not only provide strictly matched search results, but also provide some extended results such as advertisements, news, videos, etc. Moreover, many search engines personalize the extended search results page, meaning they provide recommendations based on user behavior, making our search results richer and the experience better.

Is that all? No, far from it. Because the process mentioned earlier is only a part of the work required for a search initiated by a single user. There is actually much more work involved, starting with the preparation stage before the user begins using the search engine.

To build a search engine, it is necessary to parse the stored content, which includes discourse understanding, text processing, image recognition, audio-video algorithms, and other steps. Features need to be extracted for each webpage (content), and retrieval libraries and knowledge bases need to be built. This work is very extensive and involves many aspects.

Therefore, applications of NLP have truly penetrated into all aspects of the Internet business, and mastering NLP algorithms will make us more competitive. Next, we will discuss some important content related to the “applications” aspect of Chinese NLP scenarios.

Several Important Contents of NLP #

To enable programs to understand text content, we need to address several fundamental and important topics, namely word segmentation, text representation, and keyword extraction.

Word Segmentation #

The biggest difference between Chinese and English is that English is composed of individual words separated by spaces. However, Chinese is different. Chinese words are not separated by anything other than punctuation marks. This presents a certain difficulty for programs to understand text, hence the need for word segmentation.

Although deep learning has reduced the dependency on word segmentation, and character (token) level representation of text can be achieved through techniques such as Word Embedding, the importance of word segmentation has not diminished. Word and phrase level text representation still has many practical applications.

Since this column is focused on practical application and getting you up to speed quickly, I will not dive into the details of various word segmentation algorithms, but rather concentrate on helping you understand their characteristics and teaching you how to implement word segmentation using the corresponding toolkits.

Currently, there are many open source or free NLP word segmentation tools available online, such as jieba, HanLP, THULAC, etc. Companies like Tencent, Baidu, Alibaba also have corresponding commercial tools.

Being mindful of the financial constraints, today we will use the free jieba word segmentation tool to provide an example. You can download it here. The installation of this tool is very simple, just use pip.

pip install jieba

The usage of jieba is also very convenient, let me demonstrate:

import jieba
text = "极客时间棒呆啦"
# jieba.cut produces generator-style results
seg = jieba.cut(text)  
print(' '.join(seg)) 

# Get: 极客 时间 棒呆 啦

In addition to word segmentation, jieba also provides part-of-speech (pos) tagging results:

import jieba.posseg as posseg
text = "一天不看极客时间我就浑身难受"
# Results in the form pair('word', 'pos')
seg = posseg.cut(text)  
print([se for se in seg]) 
# Get [pair('一天', 'm'), pair('不', 'd'), pair('看', 'v'), pair('极客', 'n'), pair('时间', 'n'), pair('我', 'r'), pair('就', 'd'), pair('浑身', 'n'), pair('难受', 'v')]

Isn’t it simple? Now that we have completed word segmentation, we can move on to represent the text.

Text Representation Methods #

Before deep learning became widely adopted, many traditional machine learning algorithms combined their own characteristics and used various text representations.

The most classic one is the one-hot encoding. In this method, assuming there are N words (or characters) in total, we can assign each word a unique index ID, so for any word, we can represent it with a N-dimensional list (vector). In this representation, we only need to set the position of the word’s corresponding index ID to 1 and the rest to 0.

Let me give you an example to help you understand better. Let’s say our dictionary size is 10,000, and the index ID of the word “geek” is 666. Then we need to create a vector of length 10,000 and set the position 666 to 1 and the rest to 0. Like this:

At this point, you will notice that in UTF-8 encoding, there are over 20,000 Chinese characters, and the number of words is even more astronomical. If we use the character-based representation, each character would require a vector of over 20,000 dimensions. If we consider a 10,000-word article, the data size would be enormous.

To further compress the data size, we can represent all the words in an article using a single vector. For example, in the previous example, we still create a 10,000-dimensional vector and set the corresponding positions of all the words that appear in the article to 1, and the rest to 0.

This way, the data size is significantly reduced. Are there any other methods to further reduce space usage? Yes, there is. One example is the count-based representation.

In this method, we use the form v={index1: count1, index2: count2,…, index n: count n} to count the index ID and the number of occurrences for each word. For example, for the phrase “geek time,” we only need a dictionary with two key-value pairs: {3:1, 665:1}.

This representation method is widely adopted in SVM, tree models, and other algorithms because it can significantly compress space usage and is easy to generate. However, you will notice that the methods mentioned earlier cannot express the word order information.

Let’s take an example. The phrases “I like you” and “You like me” have completely different meanings, but using the previous methods for word segmentation, they would have the same representation. If it’s actually unrequited love but mistaken, wouldn’t it be bitter?

Fortunately, the adoption of deep learning has promoted the development of word embedding. In practice, we almost always use word embedding for text representation. However, as mentioned earlier, this doesn’t mean traditional text representation methods are outdated. They still have a significant role in small-scale and lightweight text processing scenarios.

As for the word embedding part in text representation, we will discuss it in more detail in future lessons. It is also one of the core components of deep learning in NLP.

Let’s go back to the traditional text representation methods mentioned earlier. How can we record the word order information? This is where we need to address an important problem in NLP: keyword extraction.

Keyword Extraction #

As the name suggests, keywords are words that express the central content of a text. Keyword extraction plays a crucial role in applications such as search systems and recommendation systems. It is a branch of text data mining, so it is also a fundamental step in areas such as summarization, text classification/clustering, and more.

Keyword extraction methods can be broadly classified into supervised and unsupervised methods. In general, we use more unsupervised methods because they do not require manually annotated corpora. They only need to analyze the position, frequency, dependency relationships, and other information of words in the text to recognize and extract keywords.

Unsupervised methods can be divided into three main types: methods based on statistical features, methods based on graph models, and methods based on topic models. Let’s take a closer look at each of them.

Methods Based on Statistical Features #

The most classic method in this category is TF-IDF (term frequency-inverse document frequency). The core idea of this method is straightforward: the higher the frequency of a word in a document, the more important it is; but the higher its frequency in the corpus, the less important it is.

What does this mean? For example, let’s say we have 10 articles, including 2 finance articles, 5 technology articles, and 3 entertainment articles. For the word “stocks,” it will likely appear more frequently in finance articles but less frequently in entertainment and technology articles. This means that “stocks” can better “differentiate” the categories of the articles, and therefore, it is more important.

In TF-IDF, term frequency (TF) represents the frequency of a keyword occurring in the text. Inverse document frequency (IDF) is calculated by dividing the number of documents that include the word by the total number of documents. In most cases, the result is further scaled by taking its logarithm.

multiplied by

Try to think about why we add 1 to the denominator. This is to avoid the denominator being zero. After obtaining TF and IDF, we multiply them to get the TF-IDF. It is not difficult to see through TF-IDF that the characteristic of the statistical-based method lies in mathematically analyzing the frequency and distribution of words to discover the corresponding patterns and importance (weights), and using them as the basis for keyword extraction.

Similar to word segmentation, there are currently many integrated toolkits for keyword extraction, such as NLTK (Natural Language Toolkit). NLTK is a very famous natural language processing toolkit and a commonly used Python library in the field of NLP research. We can still use the “pip install nltk” command to install it.

Using NLTK to calculate TF-IDF is very simple. The code is as follows:

from nltk import word_tokenize
from nltk import TextCollection

sents=['i like jike','i want to eat apple','i like lady gaga']
# First tokenize the sentences
sents=[word_tokenize(sent) for sent in sents]

# Build the corpus
corpus=TextCollection(sents)

# Calculate TF
tf=corpus.tf('one',corpus)

# Calculate IDF
idf=corpus.idf('one')

# Calculate the TF-IDF for any word
tf_idf=corpus.tf_idf('one',corpus)

You can run this code to see how much tf_idf equals to.

Keyword Extraction based on Graph Model #

The statistical-based method mentioned earlier uses the frequency calculation of words. However, there are other extraction approaches, such as keyword extraction based on graph models.

In this method, first, we need to construct a graph structure for the text to represent the language word network. Then, we analyze the language network and look for words or phrases that play an important role, which are the keywords.

The most classic method in this category is the TextRank algorithm, which is derived from the more classic web ranking algorithm PageRank. You can refer to this wiki (link: click here) for more information about the PageRank algorithm. After understanding PageRank, you will know that the core content of the PageRank algorithm has two main points:

If a webpage has many other webpages linking to it, it means that this webpage is relatively important, and the PageRank value will be relatively high.
If a webpage with a high PageRank value links to another webpage, the PageRank value of the linked webpage will be correspondingly increased.

TextRank is quite understandable. The difference between TextRank and PageRank lies in:

Replace webpages with sentences.
The similarity between any two sentences can be calculated using a concept similar to the probability of webpage transition, but there are some differences. TextRank replaces equal transition probabilities in PageRank with normalized sentence similarities. Therefore, in TextRank, the transition probabilities of all nodes will not be exactly equal.
Use a matrix to store the similarity scores, similar to the matrix M in PageRank.

The basic process of TextRank is shown in the following figure:

It looks quite complicated, but it’s okay. Just as mentioned earlier, jieba also provides integrated algorithms for this. In jieba, you can use the following function for extraction:

jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False)

Here, the “sentence” is the text to be processed, “topK” is used to select the top K most important keywords. These two parameters are usually enough.

Keyword Extraction based on Topic Model #

The last method for keyword extraction is based on the topic model.

The topic model sounds more advanced, but it is also a statistical model. It “discovers” the abstract “topics” appearing in a document collection and uses them to explore the hidden semantic structure in the text.

LDA (Latent Dirichlet Allocation) is the most typical topic model generation algorithm. You can easily find information about the LDA algorithm by searching the web, so I won’t go into details here. In this lesson, we will use the pre-integrated package, Gensim, to implement the use of this model. The code is also very simple. Let’s take a look together:

from gensim import corpora, models
import jieba.posseg as jp
import jieba

input_content = [line.strip() for line in open('input.txt', 'r')]
# As usual, first tokenize the words
words_list = []
for text in input_content:
  words = [w.word for w in jp.cut(text)]
  words_list.append(words)

# Build the text statistics, traverse all the texts, assign a sequence ID to each unique word, and collect the number of times the word appears
dictionary = corpora.Dictionary(words_list)

# Build the corpus, convert the dictionary into a bag of words.
# The corpus is a list of vectors, and the number of vectors is the number of documents. You can output it to see its internal structure.
corpus = [dictionary.doc2bow(words) for words in words_list]

# Train the LDA model
lda_model = models.ldamodel.LdaModel(corpus=corpus, num_topics=8, id2word=dictionary, passes=10)

In the training process, “num_topics” represents the number of generated topics. “id2word” is the dictionary that maps IDs to strings. “passes” is similar to the epoch in deep learning, representing the number of times the model traverses the corpus.

Summary #

In this lesson, I’ve taken you through the application scenarios of natural language processing (NLP) and three classic fundamental NLP problems.

The three major classic NLP problems are word segmentation, text representation, and keyword extraction. Because these three problems are so classic and fundamental, there are now plenty of integrated tools available for us to use directly.

However, as I mentioned before, having tools doesn’t mean we don’t need to understand their internal principles. Learning should not only focus on knowing the facts, but also understanding the reasons behind them. This way, when we encounter problems in actual work, we can solve them with ease.

If you’ve been paying attention, you may have noticed that in today’s lesson, we used different toolkits for different problems. We used jieba for word segmentation, and gensim and NLTK for keyword extraction. Therefore, I hope that when you have time after class, you can also learn more about the specific usage and additional functionalities of these three tools, because they are truly powerful.

In the section about text representation methods, we left a small hint: Word Embedding. With the increasingly widespread use of deep learning, there are now more and more algorithms and tools available for implementing word embedding methods. In the upcoming lessons, I will introduce the training generation and usage of word embedding through practical development using BERT.

Practice for Each Lesson #

What are the drawbacks of TF-IDF? You may want to summarize them by combining them with the calculation process.

Looking forward to interacting with you in the comment section. I also recommend sharing this lesson with colleagues and friends who are interested in NLP, so that you can learn and improve together.