044 Key Steps in Document Understanding Document Clustering

044 Key Steps in Document Understanding - Document Clustering #

On Monday, we shared one of the most fundamental steps in document understanding, which is document classification. The focus there was on determining the category of information expressed in different documents. Today, I will discuss another important component of document understanding: document clustering.

Types of Document Clustering #

Similar to understanding the approach to document classification, let’s first take a look at the types of document clustering. Generally speaking, document clustering can be considered as a typical representative of unsupervised learning.

Let’s start with an intuitive classification method. If the documents are divided into several clusters that are “mutually unrelated”, it is called “Flat Clustering”. If there is a certain structural relationship between these clusters, it is called “Hierarchical Clustering”.

In “Flat Clustering”, the term “mutually unrelated” means that the clusters into which the documents are divided do not overlap with each other. The characteristic of “Hierarchical Clustering” is that it seeks to find relationships between clusters in order to organize these documents into a hierarchical structure. In this hierarchical structure, the content represented by the root node is usually more abstract, while the content represented by the leaf nodes is more specific.

It is worth noting that whether it is “Flat Clustering” or “Hierarchical Clustering”, the biggest difference compared to document classification is that these clusters and their relationships are not predefined, or in other words, the developers are not aware of the existence of these clusters in advance. From this perspective, clustering is indeed a more challenging task compared to classification, as the difficulty lies in how to measure the quality of clusters.

In addition to the distinction between “Flat Clustering” and “Hierarchical Clustering”, there is another similar distinction in clustering methods, which is the difference between “Hard Assignment” and “Soft Assignment”.

As the names suggest, “Hard Assignment” means that for each document, whether it is “Flat Clustering” or “Hierarchical Clustering,” it is deterministically assigned to one or a group of clusters. On the other hand, “Soft Assignment” often learns a distribution of document assignments to clusters, which means that all assignments exist with some probability.

Applications of Document Clustering #

Why do we emphasize document clustering in the context of search systems?

Firstly, document clustering can help with document extraction and ranking. Many documents can be grouped into a category because they are “similar” in some way. Similar documents are likely to satisfy the user’s “information needs”. In fact, in scenarios such as “language models” or other probabilistic models, predicting the relevance of documents often requires finding additional information from a group of similar documents.

For example, in a “language model”, we need to estimate the relevance of a document to a query keyword. A single document may have limited data information, so a common strategy is to supplement information from the entire dataset. If we already have document clusters, we can naturally use them to supplement information without needing the entire dataset.

Secondly, document clustering can help organize search results. In the simplest form of search results, if all results are displayed in a flat manner, users may be overwhelmed by hundreds or thousands of results. Therefore, it has become a method for many search engines to improve user experience by showing some kind of structure in these results.

Of course, we can use the method of “document classification” mentioned earlier to organize the returned results by category. This way, it is clear what results are in each category. The advantage of document clustering over document classification is that clustering can better reflect the fundamental relationships between documents, rather than defining the relationships based on predetermined categories.

Document clustering is not only a powerful tool for displaying search results, but it can also help researchers browse a collection of documents without needing too many prior assumptions. With the help of “hierarchical clustering”, researchers can easily analyze a collection of documents based on the relationships between the clusters. Using document clustering to browse a collection of documents is often an effective step in discovering problems and proceeding to the next steps.

Basic Model of Document Clustering #

The most basic method for document clustering is the “K-means algorithm”.

First, a basic step is to represent documents as “feature vectors”. The specific approach can use several methods we discussed on Monday, such as the basic “bag of words” model, which shuffles the word order completely. In the “bag of words” model, the weight of each word can be weighted using TF-IDF or a language model that we introduced before. Of course, there are also two other approaches, “N-gram” and “Recursive Neural Network” (RNN), which can be reviewed in our Monday’s content.

After representing documents as “feature vectors”, clustering can begin. The basic idea of the K-means algorithm is as follows. Given a dataset, the K-means algorithm attempts to divide all samples into K clusters. Each cluster is mutually exclusive, meaning each sample is assigned to only one cluster. The K-means algorithm optimizes an objective function, which is to minimize the average squared error of each sample to its target cluster.

Here, the target cluster is the one to which the current sample is assigned, and the cluster center is the mean of all samples assigned to the cluster. Obviously, as different samples are assigned to different clusters, the cluster center also changes accordingly. In simple terms, the objective function of the K-means algorithm aims to ensure that the samples within each cluster are tightly clustered around the mean vector of that cluster. The smaller the value of the objective function, the higher the similarity between samples within the cluster.

Similar to the familiar linear regression model and logistic regression, the objective function itself only describes a scenario when the final cluster allocation is optimal, but does not describe how to obtain the optimal cluster allocation. In practice, directly minimizing this objective function is not easy, and finding its optimal solution is generally an NP-hard problem.

Fortunately, greedy algorithms can generally find good approximate solutions. Below, I will introduce an algorithm that approximates the objective function through iterative optimization.

First, we initialize the mean vectors. A relatively simple initialization method is to directly randomly select a few points as the cluster means. Then, we assign each sample to a cluster sequentially. Each data point is assigned to the cluster whose mean vector is closest to it. After all assignments, we update the mean vectors. This completes one iteration, and the algorithm needs to perform multiple iterations to update. If the cluster result remains unchanged after iteration, the current cluster assignment result is returned.

Challenges in Document Clustering #

In the final part of today’s presentation, I would like to discuss some challenges in document clustering.

First of all, determining the quality of clustering and comparing different algorithms has always been a common problem in clustering models, even in unsupervised machine learning algorithms. Some evaluation methods are based on measuring the similarity of data within clusters and considering that data within clusters should be more similar than data between clusters. However, such a definition does not truly reflect the quality of clustering.

Secondly, in clustering algorithms, there is often a parameter that is extremely difficult to determine, which is the number of clusters. For a given dataset, it is impossible to know this parameter in advance. When the number of clusters is too small, we may not be able to provide a comprehensive description of the dataset using K-means algorithm. On the other hand, when the number of clusters is too large, the data may be fragmented into too many pieces. Therefore, determining this parameter becomes a fundamental challenge in clustering algorithm research.

Summary #

Today, I talked to you about the problem of document clustering in document understanding. Let’s review the key points together: firstly, a brief introduction to the types of document clustering. Secondly, a detailed introduction to the application scenarios of document clustering. Thirdly, an explanation of the basic document clustering algorithm, K-means. Fourthly, a brief mention of some challenges of document clustering.

Finally, I’ll leave you with a question to ponder: after obtaining the results of document clustering, can these results be used in other tasks? If so, how can they be utilized?