030 ACL 2018 Paper Review What Is End to End Semantic Hashing

030 ACL 2018 Paper Review - What is End-to-End Semantic Hashing #

Today, let’s take a look at a paper nominated for the Best Paper Award at this year’s ACL conference. The title of the paper is “NASH: Toward End-to-End Neural Architecture for Generative Semantic Hashing” (NASH: Toward End-to-End Neural Architecture for Generative Semantic Hashing).

Let’s start with a brief introduction to the authors of the paper, focusing on three individuals.

The first author, Dinghan Shen, is a PhD student in the Department of Computer Science at Duke University. He has published multiple papers in the field of natural language processing and machine learning, and has interned at NEC Labs and Microsoft Research.

Co-first author Qinliang Su is currently an associate professor at the School of Data Science and Computer Science at Sun Yat-sen University. He obtained his PhD from the University of Hong Kong and has conducted postdoctoral research at Duke University.

Lawrence Carin, one of the authors, is a professor at Duke University. Carin is an authority in the field of machine learning and is also Dinghan Shen’s advisor.

Major Contributions of the Paper #

In many applications, we often need to find the most similar or nearest documents based on a given document representation and a document library. This is often referred to as “similarity search” or “nearest-neighbor search,” and it has a wide range of applications in recommendation systems, information retrieval, image retrieval, and other fields.

“Semantic hashing” is considered an important and effective method for solving similarity search. Essentially, semantic hashing aims to represent documents as discrete, binary vectors that preserve the similarity relationships between documents in the original space. It is often referred to as a hash process with semantics, which is the origin of the name “semantic hashing.”

Once we transform documents into the semantic hashing space, the calculation of document similarity becomes the computation of the distance between discrete vectors using the “Hamming distance.” In today’s computer architectures, the calculation of Hamming distance between millions of documents can be done in just a few milliseconds. Therefore, one advantage of semantic hashing is its efficiency in computation while preserving the semantic information of the original space.

So, does semantic hashing, with such apparent advantages, have any limitations?

Although there has been considerable research on generating hashes for textual data, existing methods suffer from some obvious problems, the most significant one being that most of these methods require two stages.

There are two main approaches to these methods.

The first approach requires us to first learn the binary hash of documents in an unsupervised manner. Then, we need to train L binary classifiers to predict the L binary bits of the hash value. This step involves supervised learning.

The second approach involves learning continuous representation vectors for documents and then discretizing the continuous values during the testing phase.

Obviously, regardless of which approach, these two-stage methods inevitably yield suboptimal results. This is because the optimization processes of the two stages are disconnected. Moreover, in the process of converting continuous representation vectors to binary discretization, heuristic rules are often used, which may result in the loss of semantic information.

Based on these issues, this paper proposes an “end-to-end” training process for semantic hashing. The authors believe that this work is the first to achieve complete hash values with a single stage. Additionally, the authors leverage the latest Neural Variational Inference (NVI) framework to learn the binary encoding of documents, achieving good results in both unsupervised and supervised environments.

Another contribution of this paper is establishing a connection between the proposed method and Rate Distortion Theory. Based on this connection, the authors demonstrate how to “inject” data-dependent noise into the model’s training process to achieve better performance.

Core Method of the Paper #

The authors first regard generating “semantic hashes” from documents as a process of encoding and decoding. The binary hash vector of the document is seen as a latent variable that represents the document. In other words, the authors believe that the hash vector of the document is a set of latent variables generated from the document’s characteristics (such as TF-IDF values), which is considered as an encoding process from the document’s feature vector to the hash vector.

In previous models, much attention has been given to the encoding process, while the decoding process has rarely been directly modeled. The decoding process refers to the process of converting the generated hash vector back into the document’s feature vector. In other words, we want to be able to regenerate the original data from the hash vector.

Modeling both the encoding and decoding processes of the original data and the intermediate latent variables is a standard approach in current generative neural network models. Here, encoding and decoding each have their own neural networks to express the corresponding conditional probability distributions.

Specifically, the original information X of the data first goes through a multi-layer perceptron, and then transforms into a binary intermediate variable Z. At this point, Z is actually the hash vector we want to obtain. However, in the proposed model, there is a second part, which is to obtain a reconstruction of X from Z, which means using the hash vector to reconstruct the data. Obviously, we want the reconstructed X to be very similar to the original X, in other words, to have a minimum distance.

The authors found that learning a binary encoding from data is a typical “lossy source coding” problem in information theory. Therefore, “semantic hashing” can also be seen as a “rate distortion tradeoff” problem.

What does that mean? It means that we want to encode information with a lower rate, while also hoping that the reconstructed data from the encoding can be as close as possible to the original data. Obviously, these two have a bit of “you can’t have your cake and eat it too” meaning, meaning that these two need a balance to achieve the best results.

By defining the objective function of the rewriter model as “rate distortion tradeoff”, the authors realized that the conditional distribution from encoding to reconstructed data in the model, which is the variance value in a Gaussian distribution, actually controls this balance. Therefore, it is necessary to adjust this variance value for different documents in order to achieve the best encoding effect while maintaining rate distortion tradeoff. The authors did not optimize this variance value, but instead added some random noise around a fixed variance value, resulting in good results in practical experiments.

Experimental Results of the Paper #

The authors conducted experiments using three datasets, all of which were first transformed into TF-IDF format. The proposed method was compared with five other baseline methods.

Overall, the proposed method performed much better than the other five methods in the absence of random noise. With the addition of random noise, the model showed even better performance. Additionally, the authors demonstrated that the learned binary hash values can indeed preserve semantic information. Documents of the same text category had very similar hash values, which means, as we have mentioned before, their Hamming distance is close.

Conclusion #

Today I have shared with you a nominated paper for the best paper at ACL 2018. With that, our discussion about ACL 2018 comes to a close.

Let’s review the key points: First, the paper addresses the shortcomings of the semantic hash generation process by proposing an “end-to-end” semantic hash training process. Second, the core method of the paper views document generated semantic hash as an encoding and decoding process, further identifying that “semantic hash” can also be seen as a “ratio loss balancing” problem. Third, the paper achieved good experimental results.

Finally, I leave you with a question to ponder: In reality, are there any obstacles to utilizing semantic hash? For example, when applying semantic hash in recommendation systems, what would be the greatest challenge?