029 ACL 2018 Paper Review What Is Presupposition Triggering in Dialogue and How to Detect It

029 ACL 2018 Paper Review - What is Presupposition Triggering in Dialogue and How to Detect It #

Today, I will share with you the second best paper from ACL 2018 titled “Let’s do it ‘again’: A First Computational Approach to Detecting Adverbial Presupposition Triggers” ( Let’s do it “again”: A First Computational Approach to Detecting Adverbial Presupposition Triggers ).

The authors of this paper are all from the Department of Computer Science at McGill University in Canada. The first three student authors are co-first authors of this paper and have made equal contributions. Their advisor, Assistant Professor Jackie Chi Kit Cheung, is the last author of this paper. Jackie Chi Kit Cheung received his PhD from the University of Toronto in 2014 and has previously interned at Microsoft Research twice. He has been actively involved in research in natural language processing.

Main Contributions of the Paper #

The background of this paper starts with pragmatics. Pragmatics is a branch of linguistics that intersects and permeates with semiotics, studying the impact and contribution of context on language meaning. Pragmatics includes speech act theory, implicature, discourse in communication, as well as the analysis of human language behavior from philosophical, sociological, linguistic, and anthropological perspectives.

Pragmatics analyzes and studies the cultural norms and speech rules of language behavior such as greetings, responses, and persuasions. In cross-cultural communication, knowledge and practical ability in sociolinguistics are essential for language learners to avoid misunderstandings due to differences in language norms.

In pragmatics, “presuppositions” are the shared assumptions and knowledge agreed upon by participants in a conversation and are widely used in discourse. In this paper, the authors define the expressions that indicate “presuppositions” as “presupposition triggers,” including certain verbs, adverbs, and other phrases. To clarify these concepts, the authors provide an example.

Let’s assume we have two sentences:

John is going to the restaurant again.
John has been to the restaurant.

The first sentence can only be true if it is based on the second sentence. Specifically, the use of the presupposition trigger word “again” relies on the truth of the second sentence. In other words, the first sentence can only be understood within the context of the second sentence. It is worth mentioning that even when we negate the first sentence, “John is not going to the restaurant again,” it still requires support from the second sentence. This means that the presupposition trigger word is not affected by negation in this case.

The main contribution of this paper is the detection of presupposition triggers primarily in adverbs. This includes words such as “again,” “also,” and “still.” Prior to this research, there has been no academic work on detecting such vocabulary. The ability to detect these types of presupposition triggers can be applied to scenarios such as text summarization and dialogue systems.

To better study this task, the authors have also built two new datasets based on the well-known natural language processing datasets, Penn Treebank and English Gigaword, for the classification and detection work of trigger words. Finally, the authors have designed a recursive neural network (RNN) model based on the attention mechanism to detect presupposition triggers, achieving good results.

Core Methods of the Paper #

Now let’s discuss some details of this paper.

First, let’s see how the dataset was generated. Each data point in the data is a triple consisting of label information (positive or negative), the words of the text, and the “part-of-speech” or POS tags corresponding to the words.

A positive data point indicates that the current data contains a trigger word, while a negative data point does not. Additionally, since we are detecting adverbial trigger words, we also need to know the verb that the word relies on. The authors refer to this word as the adverb’s “governor”.

First, the authors scan the document to see if it contains any trigger words. When a trigger word is found, they extract the governor of this trigger word and then extract the 50 words before the governor, as well as all the words from the governor to the end of the sentence. This forms the words in the positive data point. After finding all the positive data points, the authors use the governors to construct negative data points. In other words, they look for sentences in the text that contain the same governor but do not include the following trigger words. These sentences become the negative data points.

Now, let’s look at some components of the model proposed by the authors. From a high-level perspective, in order to identify trigger words, the authors consider a basic model architecture of a bidirectional LSTM. On top of this, there is an “attention mechanism” to select the intermediate states of the LSTM under different conditions.

Specifically, the input of the entire model consists of two parts.

The first part is the conversion of the words in the text into word embeddings. As we have seen repeatedly, this is an essential step in utilizing deep learning models in natural language processing scenarios. The benefit of doing this is to convert discrete data into continuous vector data.

The second part is the input of the POS tags corresponding to these words. Unlike words, POS tags are still expressed with discrete features.

Then, the continuous word embeddings and discrete POS tag representations are merged together as the input of the bidirectional LSTM. Here, the purpose of using a bidirectional LSTM is to model the order of the input information. As mentioned earlier, the trigger word and the verb it relies on are obviously related to the surrounding words in a sentence. Therefore, the bidirectional LSTM can memorize this structure and extract useful intermediate variable information.

The next step is to transform the intermediate variable information into the final classification result. Here, the authors propose a concept called “Weighted Pooling Network” and use it together with the “attention” mechanism to perform this intermediate transformation.

We can say that in this step, the authors actually use the pooling operation frequently used in convolutional neural networks (CNN) in computer vision to process the document. Specifically, the authors stack all the intermediate states generated by the LSTM into a matrix, and then multiply the matrix by its own transpose to obtain a new matrix similar to a correlation matrix. It can be said that this new matrix completely captures the pairwise relationships between all the intermediate states of the current sentence after being transformed by the LSTM.

Then, the authors believe that the final classification structure is obtained by extracting information from this matrix. As for how to extract it, different weights are needed. This mechanism of setting weights based on different situations is called the “attention” mechanism. After extracting information from the matrix, it goes through fully connected layers and finally forms the standard classification output.

Experimental Results of the Paper #

The authors conducted experiments on the two new datasets mentioned above and compared them with a series of methods. The other methods include a simple logistic regression method, a simplified model that still uses bidirectional LSTM structure, and a model that uses CNN for information extraction.

On both datasets, the proposed method in the paper performs better than the logistic regression and CNN methods by around 10% to 20%. Compared with the simplified LSTM model, the advantage is not as significant, but it still has statistically significant good results.

Summary #

Today I have told you about another best paper from ACL 2018.

Let’s recap the key points together: First, the background of this paper is pragmatics, and its core contribution is the detection of presupposition triggers mainly consisting of adverbs. Second, the core method of the paper is proposing a basic model architecture of bidirectional LSTM, and using an “attention mechanism” to set weights based on different situations. Third, the paper constructs two datasets and achieves good experimental results.

Finally, I leave you with a question to reflect on: Can we use unidirectional LSTM instead of bidirectional LSTM in this paper?