045 Important Cases in Document Understanding Multimodal Document Modeling

045 Important Cases in Document Understanding - Multimodal Document Modeling #

This week, we are focusing on an important component of search systems, which is document understanding. On Monday, we first discussed the most basic step of document understanding, which is document classification, primarily to determine the category of information expressed in different documents. Then, on Wednesday, we talked about another important component of document understanding, which is the basic concepts and techniques of document clustering. Today, I will share with you an important special case of document understanding: multimodal document modeling.

Multimodal Data #

First, let’s find out what multimodal data is.

Multimodal data refers to data that can be expressed in multiple modes. These different modes collectively describe different aspects of the same data point.

For example, a photo shows the speech of President Trump at the White House in Washington. The photo itself is a description of this scene, which is one mode. Then, the corresponding text describes that this is Trump’s speech at the White House, which is another mode. These two modes complement each other and both describe the same scene. It is clear that modeling multiple data modes is a very important topic in the era of multimedia and social media.

In the field of documentation, it is very common to have a mix of text and images. Generally, news websites have a large amount of graphical information. In some special cases, there is an asymmetrical mix of text and images. For example, many short documents on social media platforms like Instagram, Pinterest, and even Twitter, only contain images or images with limited text. In these cases, text and images become important complementary sources of information.

Furthermore, on e-commerce websites, product images are becoming increasingly important as an information channel. Users often rely on images to determine whether to purchase a product. It is rare to find product information on e-commerce websites that only consist of text descriptions. Therefore, for document search, understanding graphical and textual information is a core technical challenge.

So, what are the difficulties in modeling multimodal data?

Different modes of data have different characteristics. The challenge in multimodal data modeling lies in effectively utilizing their respective features to best reflect a specific task, such as classification or clustering.

Basics of multimodal data modeling #

So, how do we model multiple modes of data?

The key idea of multimodal data modeling is data representation. We need to think about how to learn the representation of text and images, and how to connect the representations of text and images.

One direct approach is to use various text features that we are familiar with, and then extract image features using image-related techniques to represent the images. After obtaining the representations of text and images, we simply concatenate the two different feature vectors to obtain a “joint representation”.

For example, let’s say we have learned a 1000-dimensional text feature vector and a 500-dimensional image feature vector. In this case, the joint feature vector would be 1500-dimensional.

A relatively modern approach is to use two different neural networks to represent text and images separately. After the neural networks learn “hidden units” to express image and text information, we then combine these “hidden units” to form the “joint hidden units” of the entire document.

Another approach is to keep the representations of different modalities separate, instead of merging them. In the case of text and images, this means maintaining the representations or feature vectors of text and images separately, and maintaining the connection between these two representations through some relationship.

One hypothesis is that although the appearances of different data modalities are different, such as the final presentation of images and text, they share similar representations of the core content. Therefore, the intrinsic representations of these data modalities are likely to be similar.

Applying this hypothesis here, we assume that the representations of text and images are similar, and this “similarity” can be described using a similarity function, such as the “cosine similarity function”.

With these two approaches in mind, a mixed approach naturally emerges. The basic idea of the mixed approach is as follows: different modes of data must have different representations of some underlying representation, so a unified underlying representation is needed. However, using only one representation to represent different data sources is not flexible enough. Therefore, in this mixed approach, we still need two different features to represent text and images.

Specifically, the mixed approach is as follows: first, we learn a unified joint representation from the raw data of text and images. Then, we assume that the representations of text and images are “developed” or “generated” from this joint representation. Clearly, in this architecture, we must learn both the joint representation and the separate representations of the two modes that are derived from the joint representation.

It is worth noting that whether it is from raw data to joint representation or from joint representation to separate representations, these steps can be simple models, typically multilayer neural network models.

It is worth mentioning that in the case of requiring multiple representations, whether it is joint representation or separate representations, the original inputs of text and images, or even the initial representations, do not necessarily have to be learned “end-to-end” from the current data. In fact, using pre-trained word embedding vectors from other datasets as the input for text is a very popular and efficient practice.

Once we have the data representation, the next step naturally involves using these learned representations for the next tasks. Let’s take document classification as an example. After obtaining the joint representation, the next step is to use this new representation as the feature of the entire document and learn a classifier for the classification task. For independent data representations, the typical approach is to learn a classifier for each separate representation. In this way, we have two independent classifiers, one for text information and one for image information.

After obtaining these two classifiers, we then learn a third classifier that uses the classification results of the previous two classifiers as the new features. In other words, at this point, the classification results have become the new features for the third classifier. Clearly, this process requires training multiple different classifiers, adding complexity to the entire workflow.

Other Applications of Multimodal Data Modeling #

In addition to the learning representation and classifier construction that I mentioned earlier, there are several other challenging tasks related to multimodal data.

In cases where we have both text and images, we often need to convert or “translate” between these two modes. For example, given an image, how can we generate an accurate textual description of the image? Or, given text, how can we find or even generate an accurate image? Of course, such “translation” is not only limited to text and images but also widely applicable to other data modalities, such as between text and speech, speech and images, and so on.

Building on this translation task, another level of complexity is to “align” text and images, among other types of information. For example, given a set of images, we can generate textual descriptions based on the variations observed in the images.

Another application is Visual Question & Answering, which involves answering questions using both images and text. Obviously, to answer questions effectively, we need to model both the image and textual information simultaneously.

Regardless of whether it is translation or visual question answering, these tasks heavily rely on sequence models, such as RNN or LSTM, which have been extensively used in recent years thanks to the advancements in deep learning.

Summary #

Today I discussed the problem of multimodal data modeling in document understanding. You can see that this is a very hot field, and understanding multimodal data is an important issue in modern data processing.

Let’s review the key points: first, a brief introduction to what multimodal data is. Second, a detailed introduction to some basic ideas of multimodal data modeling, including how to obtain representations of documents, what are joint representations, and what are independent representations. Then, we also talked about how to build different classifiers. Third, a brief mention of other multimodal data modeling tasks and the basic trends in deep learning that these tasks rely on.

Finally, I’ll leave you with a question to think about: does multimodal modeling bring rich features, and classifiers trained with these rich features will always perform better than classifiers trained with a single data source?