23 Sentiment Analysis How to Perform Sentiment Analysis With Lstm

23 Sentiment Analysis How to Perform Sentiment Analysis with LSTM #

Hello, I’m Fang Yuan.

Welcome to learn sentiment analysis with me. Today we are going to talk about text sentiment analysis in machine learning. Text sentiment analysis, also known as opinion extraction, topic analysis, and sentiment analysis, may sound a bit abstract if we only talk about concepts. Let’s look at an application in our daily life together, and you will understand it easily.

For example, when we choose a product on a shopping website, we usually read the product reviews to see if there are any negative reviews. These reviews express various emotions and sentiments, such as happiness, anger, sadness, praise, criticism, etc. Just like this, the computer can automatically distinguish whether an evaluation belongs to positive, neutral, or negative sentiment based on the review text. This technique is called sentiment analysis.

If you observe further, you will find that there are also some tags above positive and negative reviews, such as “appropriate volume,” “fast connection speed,” “good after-sales service,” etc. These tags are actually the topics or opinions automatically extracted by the computer based on the text.

The rapid development of sentiment analysis is due to the rise of social media. Since the early 2000s, sentiment analysis has grown into one of the most active research areas in natural language processing (NLP). It has been widely used in personalized recommendations, business decisions, public opinion monitoring, and other fields.

In today’s class, we will complete a sentiment analysis project and analyze movie review texts together.

Data Preparation #

We currently have a batch of movie review data (IMDB dataset) in our hands, where the reviews are divided into two categories: positive reviews and negative reviews. We need to train a sentiment analysis model to classify the movie review texts.

This problem is essentially a text classification problem, where the research object is the text of movie reviews, and we need to perform binary classification on the text. Now let’s take a look at the training data.

IMDB (Internet Movie Database) is an online movie database that contains 50,000 highly polarized movie reviews. The dataset is divided into a training set and a test set, each containing 25,000 reviews. Both the training set and the test set consist of 50% positive reviews and 50% negative reviews.

How to Read the Dataset Using Torchtext #

We can use the Torchtext toolkit to read the dataset.

Torchtext is a toolkit that contains common text processing tools and common natural language datasets. We can think of it in a similar way to the Torchvision package we have learned before, except that Torchvision is used to handle images, while Torchtext is used to handle text.

Installing Torchtext is also very simple. We can use pip to install it, as shown below:

pip install torchtext

Torchtext includes the IMDB dataset that we want to use, as well as functionalities such as reading corpora, word-to-word vector conversion, word-to-index conversion, and building corresponding iterators, which can meet our text processing needs.

What’s even more convenient is that Torchtext has already included some commonly used text processing datasets in torchtext.datasets. Similar to Torchvision, the datasets will be automatically downloaded, extracted, and parsed when used.

Taking IMDB as an example, we can use the following code to read the dataset:

# Read the IMDB dataset
import torchtext
train_iter = torchtext.datasets.IMDB(root='./data', split='train')
next(train_iter)

The torchtext.datasets.IMDB function has two parameters:

  • root: a string that specifies the location where you want to read the target dataset. If the dataset does not exist, it will be automatically downloaded.
  • split: a string or tuple that indicates the type of dataset to return, whether it is the training set, test set, or validation set. The default value is ('train', 'test').

The return value of the torchtext.datasets.IMDB function is an iterator. Here we read the training set from the IMDB dataset, which contains 25,000 data points, and store it in the variable train_iter.

The result of running the program is shown in the figure below. We can see that by using the next() function, we can retrieve one data instance from the iterator train_iter. Each line consists of a sentiment classification and the corresponding review text. “neg” represents negative reviews and “pos” represents positive reviews.

Image

Data Processing Pipelines #

After retrieving the review texts and sentiment labels from the dataset, we need to convert them into vectors that can be read by computers. The general process of text processing involves tokenization, followed by converting words into IDs based on a vocabulary.

Torchtext provides us with basic text processing tools, including a tokenizer and a vocabulary. We can create a tokenizer and a vocabulary using the following two functions.

The get_tokenizer function is used to create a tokenizer. By feeding the text to the corresponding tokenizer, the tokenizer can tokenize the text according to the rules of different tokenization functions. For example, for English, the tokenizer simply tokenizes the text based on spaces and punctuation marks.

The build_vocab_from_iterator function helps us build a vocabulary from the iterator of the training dataset. After building the vocabulary, we can input the tokenized texts and get the ID of each word.

The code for creating a tokenizer and building a vocabulary is as follows. First, we need to create a tokenizer tokenizer that can process English texts, and then we build a vocabulary vocab based on the training set iterator train_iter of the IMDB dataset.

# Create a tokenizer
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
print(tokenizer('here is an example!'))
'''
Output: ['here', 'is', 'an', 'example', '!']
'''

# Build a vocabulary
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = torchtext.vocab.build_vocab_from_iterator(yield_tokens(train_iter), specials=["<pad>", "<unk>"])
vocab.set_default_index(vocab["<unk>"])

print(vocab(tokenizer('here is an example <pad> <pad>')))
'''
Output: [131, 9, 40, 464, 0, 0]
'''

In the process of building the vocabulary, the yield_tokens function is used to tokenize each data in the training dataset one by one. In addition, when building the vocabulary, users can also customize the word list using the specials parameter.

In the above code, we defined two custom words: "" and “”, representing placeholder and out-of-vocabulary words. As the length of each movie review text is different, we cannot directly batch them into a matrix. Therefore, we need to fix the length by truncating or padding with placeholders.

To facilitate later use, we use tokenizer and vocabulary to establish data processing pipelines. The text pipeline takes a piece of text as input and returns the tokenized ID list. The label pipeline converts sentiment classification into numbers, i.e., “neg” is converted to 0 and “pos” is converted to 1.

The specific code is shown below.

# Data processing pipelines
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: 1 if x == 'pos' else 0

print(text_pipeline('here is an example'))
'''
Output: [131, 9, 40, 464, 0, 0, ..., 0]
'''
print(label_pipeline('neg'))
'''
Output: 0
'''

Based on the output of the examples, I believe you can easily understand how to use the text pipeline and label pipeline.

Generating Training Data #

With the data processing pipelines, the next step is to generate training data, which involves handling variable-length data. When converting the ID list generated by the text pipeline into tensors that the model can recognize, the lengths of the generated tensors are different because the sentences in the texts have variable lengths, and therefore cannot form a matrix.

To solve this, we need to set a maximum length for the sentences. For example, if the maximum length of the sentences is 256 words, sentences longer than 256 words need to be truncated, and sentences shorter than 256 words need to be padded. Here, we use “/ " as the padding.

All the operations mentioned above can be done in the collate_batch function.

What is the purpose of the collate_batch function? It is responsible for preprocessing a batch of samples extracted by the DataLoader, including generating tensors for the text, generating tensors for the labels, generating tensors for the lengths of the sentences, and performing truncation and padding operations on the texts as mentioned above. Therefore, by passing the collate_batch function to DataLoader through the collate_fn parameter, we can handle variable-length data.

The definition of the collate_batch function and the code for generating training and validation DataLoaders are as follows:

# Generating training data
import torch
import torchtext
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    max_length = 256
    pad = text_pipeline('<pad>')
    label_list, text_list, length_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = text_pipeline(_text)[:max_length]
        length_list.append(len(processed_text))
        text_list.append((processed_text+pad*max_length)[:max_length])
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.tensor(text_list, dtype=torch.int64)
    length_list = torch.tensor(length_list, dtype=torch.int64)
    return label_list.to(device), text_list.to(device), length_list.to(device)

train_iter = torchtext.datasets.IMDB(root='./data', split='train')
train_dataset = to_map_style_dataset(train_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])
train_dataloader = DataLoader(split_train_, batch_size=8, shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=8, shuffle=False, collate_fn=collate_batch)

Let’s go through the steps of this code together. There are five steps in total.

  1. Use torchtext to read the IMDB training dataset and get the training data iterator.
  2. Use the to_map_style_dataset function to convert the iterator to a Dataset type.
  3. Use the random_split function to split the Dataset, with 95% as the training set and 5% as the validation set.
  4. Generate the DataLoader for the training set.
  5. Generate the DataLoader for the validation set.

With this, the data preparation part is complete, and now we can proceed to the construction of the network model.

Model Construction #

Previously, we have learned about convolutional neural networks (CNNs). CNNs take fixed-size matrices as inputs (e.g., images) and output fixed-size vectors (e.g., probabilities of different classes), making them suitable for image classification, object detection, image segmentation, and more.

However, besides images, there is a lot of information that is not of fixed size or length, such as audio, video, and text. To handle such sequence-related data, we need to use sequential models. One common type of sequential model is the Recurrent Neural Network (RNN), which we will be using to process textual data in today’s case.

However, RNNs themselves have issues with vanishing gradients or exploding gradients during backpropagation. The Long Short-Term Memory (LSTM) network improves upon the RNN structure by incorporating clever gating mechanisms to combine short-term and long-term memory, addressing the problems of vanishing and exploding gradients to some extent.

We will use the LSTM network for sentiment classification. The definition of the model is as follows:

# Model Definition
class LSTM(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, 
                 dropout_rate, pad_index=0):
        super().__init__()
        self.embedding = torch.nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim, n_layers, bidirectional=bidirectional,
                                  dropout=dropout_rate, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = torch.nn.Dropout(dropout_rate)
    
    def forward(self, ids, length):
        embedded = self.dropout(self.embedding(ids))
        packed_embedded = torch.nn.utils.rnn.pack_padded_sequence(embedded, length, batch_first=True, 
                                                                  enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        output, output_length = torch.nn.utils.rnn.pad_packed_sequence(packed_output)
        if self.lstm.bidirectional:
            hidden = self.dropout(torch.cat([hidden[-1], hidden[-2]], dim=-1))
        else:
            hidden = self.dropout(hidden[-1])
        prediction = self.fc(hidden)
        return prediction

The network model structure consists of an Embedding layer that receives a tensor of text IDs, followed by an LSTM layer, and finally a fully connected classification layer. The bidirectional parameter is set to True for a bidirectional LSTM network, and False for a unidirectional LSTM network.

The structure diagram of the network model is illustrated below:

Image

Model Training and Evaluation #

Once we have defined the structure of the neural network model, we can proceed with training the model. First, we need to instantiate the network model. The parameters and code for instantiation are as follows:

# Instantiate the model
vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 300
output_dim = 2
n_layers = 2
bidirectional = True
dropout_rate = 0.5

model = LSTM(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout_rate)
model = model.to(device)

Since the sentiment polarity of the data is divided into two categories, we set the value of output_dim to 2.

Next, we define the loss function and optimization method, with the following code. This has been covered in previous lessons, so we won’t repeat it here.

# Loss function and optimization method
lr = 5e-4
criterion = torch.nn.CrossEntropyLoss()
criterion = criterion.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

The code for calculating the loss is as follows:

import tqdm
import sys
import numpy as np

def train(dataloader, model, criterion, optimizer, device):
    model.train()
    epoch_losses = []
    epoch_accs = []
    for batch in tqdm.tqdm(dataloader, desc='training...', file=sys.stdout):
        (label, ids, length) = batch
        label = label.to(device)
        ids = ids.to(device)
        length = length.to(device)
        prediction = model(ids, length)
        loss = criterion(prediction, label) # calculating loss
        accuracy = get_accuracy(prediction, label)
        # update gradients
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())
    return epoch_losses, epoch_accs

def evaluate(dataloader, model, criterion, device):
    model.eval()
    epoch_losses = []
    epoch_accs = []
    with torch.no_grad():
        for batch in tqdm.tqdm(dataloader, desc='evaluating...', file=sys.stdout):
            (label, ids, length) = batch
            label = label.to(device)
            ids = ids.to(device)
            length = length.to(device)
            prediction = model(ids, length)
            loss = criterion(prediction, label) # calculating loss
            accuracy = get_accuracy(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())
    return epoch_losses, epoch_accs

As seen from the code, the loss calculation during training and evaluation is defined in the train function and evaluate function, respectively. The main difference is that the training process involves gradient update, while the evaluation process does not. In the evaluation process, only the loss needs to be calculated.

For evaluation, we use accuracy (ACC) as the evaluation metric. The code for calculating accuracy is as follows:

def get_accuracy(prediction, label):
    batch_size, _ = prediction.shape
    predicted_classes = prediction.argmax(dim=-1)
    correct_predictions = predicted_classes.eq(label).sum()
    accuracy = correct_predictions / batch_size
    return accuracy

Finally, the specific code for the training process is as follows. This includes calculating loss and accuracy, saving the losses list, and saving the best model.

n_epochs = 10
best_valid_loss = float('inf')

train_losses = []
train_accs = []
valid_losses = []
valid_accs = []

for epoch in range(n_epochs):
    train_loss, train_acc = train(train_dataloader, model, criterion, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_dataloader, model, criterion, device)
    train_losses.extend(train_loss)
    train_accs.extend(train_acc)
    valid_losses.extend(valid_loss)
    valid_accs.extend(valid_acc)    
    epoch_train_loss = np.mean(train_loss)
    epoch_train_acc = np.mean(train_acc)
    epoch_valid_loss = np.mean(valid_loss)
    epoch_valid_acc = np.mean(valid_acc)    
    if epoch_valid_loss < best_valid_loss:
        best_valid_loss = epoch_valid_loss
        torch.save(model.state_dict(), 'lstm.pt')
    print(f'epoch: {epoch+1}')
    print(f'train_loss: {epoch_train_loss:.3f}, train_acc: {epoch_train_acc:.3f}')
    print(f'valid_loss: {epoch_valid_loss:.3f}, valid_acc: {epoch_valid_acc:.3f}')

By saving the train_losses list, we can plot the loss curve during training or use the visualization tools described in Lesson 15 to monitor the training process.

With this, a complete sentiment analysis project is completed. From data loading to model construction and training, I have provided you with a step-by-step guide. I hope that you can use this as a template to independently solve your own problems.

Summary #

Congratulations on completing today’s learning task. Today we have completed a practical project of sentiment analysis, which is like an initial exploration of natural language processing tasks. Let’s review the key points we learned today.

In the data preparation phase, we can use the text processing toolkit Torchtext provided by PyTorch. Mastering Torchtext is not difficult. We can compare it with Torchvision that we have detailed before. If you don’t understand anything, you can refer to the documentation and I believe you can learn to apply it flexibly.

When building the model, we should choose the appropriate neural network based on the specific problem. Convolutional neural networks are often used to handle prediction problems with images as inputs; recurrent neural networks are often used to handle variable-length, sequence-related data. LSTM, compared to RNN, can better solve the problems of vanishing gradients and exploding gradients.

In the following courses, we will also explain two major natural language processing tasks: text classification and text summarization, which include discriminative models and generative models respectively. I believe that by then you will have a deeper understanding of text processing.

Practice for each lesson #

Using the trained model from today’s training, write a function predict_sentiment to input a sentence and output the sentiment category and probability of that sentence.

For example:

text = "This film is terrible!"
predict_sentiment(text, model, tokenizer, vocab, device)
'''
Output: ('neg', 0.8874172568321228)
'''

Feel free to leave a message in the comments to interact with me. I also recommend sharing what you have learned today with more friends and learning and progressing together.