043 Document Understanding Step One Document Classification

043 Document Understanding Step One - Document Classification #

In the past few weeks, we have discussed the classic Information Retrieval techniques and machine learning-based Learning to Rank algorithms in this column, and we have also spent a certain amount of time sharing the core technological points of Query Understanding, a crucial component of search. Last week, we discussed in detail how to evaluate a search system from both online and offline perspectives.

This week, our discussion will shift to another important component of search: Document Understanding. This involves extracting various features from documents to help retrieval algorithms find more relevant documents.

The most basic step in document understanding is Document Classification, which involves determining what category of information these documents express. Today, I will talk to you about some basic concepts and techniques of document classification, so that you can have a basic understanding of its development and research.

Types of Document Classification #

If we consider document classification as a supervised learning task, then the following types of document classification are commonly used in various applications.

The first category is binary classification, also known as binary document classification, which aims to categorize documents into two different classes. For example, classifying documents as “business” or “non-business”.

The second category is multiclass classification, which involves determining whether a document belongs to one of several different classes. For example, categorizing documents as “art,” “business,” “computing,” or “sports”.

Of course, within multiclass classification, we can further divide it into three subcategories.

The first subcategory is “multiclass-single-label-hard classification”, where each document can only be assigned a unique label in a multiclass classification problem, and all classes are mutually exclusive.

The second subcategory is “multiclass-multilabel-hard classification”, where each document can be considered as belonging to multiple classes, but each classification is unique.

The final subcategory is “multiclass-soft classification”, where each document is assigned probabilities to indicate its membership to multiple classes.

In addition to these classifications, there is a method where all classes can be considered as a flat structure or an organized structure. When documents are classified into a hierarchical organization, it is referred to as “hierarchical classification”. In this case, a document belongs to all categories within the hierarchical structure, from the root node to the leaf nodes. Generally, higher-level nodes are more abstract compared to lower-level nodes.

Classic Characteristics of Document Classification #

After understanding the basic types of document classification, let’s discuss the classic characteristics used in document classification.

The first thing that comes to mind is using the original text information on the document. The most direct textual characteristic may be each English word or Chinese word. This way of completely shuffling the order of words is called " bag-of-words model “.

From many practitioners’ reports, the “bag-of-words model” is still a very effective way of feature representation in practical use, despite not considering the order of words. In the “bag-of-words model”, the weight of each word can be weighted using TF-IDF or language models introduced previously. When it comes to TF-IDF and language models, I recommend reviewing the previous content.

In addition to the “bag-of-words model”, there are other attempts to preserve some or all of the word order.

For example, the " n-gram " method we discussed before is a very effective way to preserve partial word order in text representation. However, the biggest problem with n-grams is that it greatly increases the feature space and reduces the observed frequency of each n-gram tuple, leading to sparsity problems.

In addition to n-grams, in recent years, with the promotion of deep learning, a newer approach is to use " recurrent neural networks " (RNN) to model sequences, which in this case are words or sentences. Many studies have shown that this approach is significantly better than the “bag-of-words model”.

In addition to the original text on the document, the formatting of the document layout is also important. Some fields have obvious characteristics, such as the title of a document, which is obviously important. Some documents have structures such as “chapters” and “paragraphs”, and these subheadings have a significant guiding meaning for the main content of the document. Therefore, modeling different “fields” (sometimes called “domains”) of a document may have a significant impact on document classification.

Furthermore, for certain special documents, considering only the basic text information may not be sufficient. For example, the original HTML representation of a modern webpage may differ significantly from the visual effects presented in a browser. Therefore, for webpages, we may also need to extract features using the visual effects presented in a browser.

For isolated documents, the information of an individual document may be limited. However, on the Internet, many documents are not isolated and are interconnected through various links. Taking ordinary webpages as an example, one characteristic of the Internet is that many webpages are linked together through various links. These other pages connected to the current webpage may provide additional information for the current page.

Among all these surrounding pages, there is a type of page worth mentioning here. These pages have links pointing to the target webpage that we need to classify. These links often have text descriptions that describe some traits of the target webpage, and even some surrounding text descriptions are meaningful.

For example, the current webpage is the homepage of Microsoft Corporation, which may lack text descriptions due to various beautiful images, but there may be links on the surrounding pages pointing to the homepage of Microsoft Corporation with descriptions like “Official Website of Microsoft Corporation”. In this way, we derive information about “Microsoft Corporation” from these anchor texts, and if we also know that Microsoft Corporation is a software company, it becomes relatively easy to classify this webpage.

Based on this idea, we can try to use more information from surrounding documents. However, it is worth noting that there is also a lot of “noise” in the information carried by surrounding documents. There have been various research efforts to understand more valuable information from surrounding documents, which will not be elaborated here.

Algorithms for Document Classification #

Based on the different types of document classification that we just discussed, we can directly use known and familiar supervised learning algorithms and models.

For a simple binary document classification problem, “Logistic Regression”, “Support Vector Machines” (SVM), and “Naïve Bayes Classifier” are all capable of handling the task. And for multi-class classification problems, which are also standard supervised learning settings, the algorithms and models mentioned earlier can be modified to address them.

In recent years, deep learning has made significant advancements in various fields. In the domain of document classification, different deep learning models have also demonstrated certain advantages.

It is important to note that not all classification algorithms natively support “probabilistic output results.” This means that if we need to model a “multi-class-soft classification” document problem, there will be some challenges. Support Vector Machines are an example of such a case. In its default state, Support Vector Machines do not output the probability of each data instance belonging to each class.

Therefore, some techniques need to be employed. In practical applications, we often use a method called “Platt Scaling.” Simply put, it treats the output results of Support Vector Machines as new features and learns a logistic regression model.

In addition to using basic supervised learning approaches for document classification, another method is to utilize the relatedness of documents, known as “Relational Learning.” Relational Learning aims to improve the effectiveness of document classification by considering the relationships between documents. Many methods in this aspect utilize the idea that “similar pages are likely to belong to the same class.”

If we are dealing with a “hierarchical classification” scenario, similar pages are likely to be closer in the hierarchy. “Similarity” can be defined based on textual information or the positions in a “graph” formed by documents.

For example, many subpages of a company may have textual differences, but because they are all pages of the same company, they represent the company’s information from the perspective of the larger document page network. Therefore, when performing document classification, it is likely to group them together.

Conclusion #

Today I talked to you about another crucial aspect of modern search technology, which is the document categorization problem in document understanding. You can see that there is quite a lot of information to be understood in document categorization.

Let’s review the key points together: First, I briefly introduced the main types of document categorization, including binary classification, multiclass classification, and hierarchical classification. Second, I provided a detailed explanation of various features that may be used in document categorization, such as the textual information on the document, the formatting of the document, and the related documents in the vicinity. Third, I explained how to use supervised learning and other algorithmic tools to accomplish the task of document categorization.

Finally, I leave you with a question to ponder: If a document contains both images and text, how should we organize these features and then incorporate them into our classifier for learning?