25 Summary How to Quickly Realize Automatic Summary Generation

25 Summary How to Quickly Realize Automatic Summary Generation #

Hello, I am Fang Yuan.

When we open a news app or a website, we are often attracted by titles like “Shock to One Billion People” or “You Must Read to Save Your Life!” However, when we click into the article, we find that the actual content is totally different! At this time, you may wonder if there is a tool that can help us extract the key information from the article, so that we won’t be affected by clickbait titles anymore. In fact, it is not difficult to implement such a tool, and it can be solved with automatic text summarization technology.

Automatic text summarization is pervasive in various aspects of our lives. It can be used for hot news aggregation, news recommendation, voice broadcasting, app message push, intelligent writing, and other scenarios. The natural language processing task we are going to discuss today is automatic text summarization generation.

Problem Background #

Automatic summarization technology refers to automatically extracting a few sentences to summarize the main idea of an entire article, allowing users to understand the intended meaning of the original text by reading the summary.

Extractive vs. Abstractive #

There are two solutions for automatic summarization: extractive and abstractive. The extractive approach involves extracting key sentences from the original text and combining them to create a summary. The other approach, which we will focus on in this lesson, is abstractive summarization. It requires the computer to read the original text and, based on understanding the content, use concise and coherent language to express the main points of the text, including words and sentences that may not have appeared in the original text.

Currently, extractive summarization techniques are relatively mature, but the quality and fluency are not ideal. With advancements in deep learning research, the quality and fluency of abstractive summarization has greatly improved. However, it still has limitations such as lengthy original text and suboptimal extraction content, resulting in a significant gap compared to human-generated summaries.

Language can be expressed in various ways, so machine-generated summaries may not be identical to human-generated summaries. How can we evaluate the quality of automatic summarization? This is where evaluation metrics come into play.

Evaluation Metrics #

The effectiveness of automatic summarization is typically evaluated using ROUGE (Recall Oriented Understudy for Gisting Evaluation).

ROUGE evaluation method was inspired by automatic evaluation methods for machine translation, taking into account the degree of co-occurrence of N-grams. The method is designed as follows: firstly, multiple experts generate manual summaries, forming a reference summary set; then, the automatically generated summaries are compared with the reference summaries to evaluate the quality of the summaries by counting the number of overlapping units (n-grams, word sequences, or word pairs). By comparing with multiple expert-generated summaries, the stability and robustness of the evaluation system are improved.

ROUGE includes the following 4 evaluation metrics:

ROUGE-N, based on the co-occurrence statistics of n-grams.
ROUGE-L, based on the longest common subsequence.
ROUGE-S, based on sequential word pair statistics.
ROUGE-W, based on continuous matches in ROUGE-L.

Now that we are familiar with the types of automatic summarization and evaluation metrics, let’s learn about a model used for automatic summarization called BART. It has a similar name to BERT, which we discussed in the previous lesson. Let’s first examine its characteristics.

Analysis of BART Principles and Characteristics #

BART stands for Bidirectional and Auto-Regressive Transformers. It is a new pre-training model proposed by Facebook AI in 2019, which combines bidirectional Transformers and auto-regressive Transformers, achieving state-of-the-art results in text generation related tasks. You can find the related paper through this link.

We are already familiar with the Transformer introduced in the paper “Attention is all you need”. The left half of the Transformer is the Encoder, and the right half is the Decoder. The structures of the Encoder and Decoder are shown in the following figure (a) and (b).

The Encoder is responsible for performing self-attention on the original text and obtaining word embeddings for each word in the sentence. The most classic Encoder architecture is BERT, which we learned in the previous lesson. However, the standalone Encoder structure is not suitable for text generation tasks.

The input and output of the Decoder are shifted by one position to simulate the inability of the model to see future words during text generation. This approach is called Auto-Regressive. Models based on the Decoder structure, such as GPT, are usually used for text generation tasks, but cannot learn bidirectional contextual information.

The BART model is a sequence-to-sequence structure that combines the Encoder and Decoder. Its main structure is shown in the following figure.

The structure of the BART model appears to be similar to the Transformer, with the main difference lying in the pre-training stage of BART. First, multiple types of noise are applied to the original text on the Encoder side for corruption, and then the Decoder is used to reconstruct the original text.

Since BART itself is built on the foundation of sequence-to-sequence and undergoes pre-training, it is naturally suitable for sequence generation tasks, such as question answering, text summarization, and machine translation. While making progress in generation tasks, it can also achieve good performance in some text comprehension tasks.

Next, we will enter the practical stage and use BART to implement automatic extractive summarization.

Quick Summary Generation #

Here we will still use the Hugging Face Transformers toolkit. The specific installation process was introduced in the previous lesson.

The Transformers toolkit provides a pipeline API for quickly using automatic summarization models. The pipeline aggregates text preprocessing steps and a trained automatic summarization model. With the help of Transformers’ pipeline, we only need a few lines of code to quickly generate text summaries.

Below is an example of using the pipeline to generate a summary, with the code as follows.

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney's Office by Immigration and Customs Enforcement and the Department of Homeland Security's
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

print(summarizer(ARTICLE, max_length=130, min_length=30))
'''
Output:
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]
'''

The function of the code on line 3 is to construct a pipeline for automatic summarization. The pipeline automatically downloads and caches the trained automatic summarization model. This automatic summarization model is trained on the BART model using the CNN/Daily Mail dataset.

The code from line 5 to 22 is the original text of the article to be summarized. The code on line 24 automatically generates a summary based on the original text. The parameters max_length and min_length limit the maximum and minimum length of the summary. The output result is shown in the code comment above.

If you don’t want to use the pre-trained models provided by Transformers and instead want to use your own model or any other model, it’s also simple. The specific code is as follows.

from transformers import BartTokenizer, BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

inputs = tokenizer([ARTICLE], max_length=1024, return_tensors='pt')

# Generate summary
summary_ids = model.generate(inputs['input_ids'], max_length=130, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

The process consists of four steps, we will look at each step.

The first step is to instantiate a BART model and tokenizer object. BartForConditionalGeneration is the class of the BART model used for summarization, and BartTokenizer is the tokenizer for BART. Both of them have a from_pretrained() method that can load pre-trained models.

The from_pretrained() function requires a string as the parameter, which can be the path to a local model or the name of a model uploaded to the Hugging Face model hub.

Here, "facebook/bart-large-cnn" is the BART model trained by Facebook on the CNN/Daily Mail dataset. You can refer to here for the specific details of the model.

Next is the second step, tokenizing the original text. We can use the tokenizer object tokenizer to tokenize the original text ARTICLE and get the word id tensor. return_tensors='pt' indicates that the return value is a PyTorch tensor.

The third step is to generate the summary using the generate() method. The parameter max_length limits the maximum length of the generated summary, and early_stopping indicates whether the generation process can be stopped early. The output of the generate() method is the word ids of the summary.

The final step is to decode the final summary text using the tokenizer. Using the tokenizer.decode() function, we convert the word ids back to text. The parameter skip_special_tokens indicates whether to remove special tokens such as <s>, </s>, etc.

Fine-tuning BART #

Now let’s take a look at how to train BART model with your own dataset.

Model Loading #

The model loading part is the same as before, so I won’t repeat it too much. Here we will use the from_pretrained() function of the BartForConditionalGeneration class to load a BART model.

The code for model loading is as follows. Here we will continue fine-tuning on the summary model trained by Facebook.

from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/facebook/bart-large-cnn')

Data Preparation #

Next, let’s prepare the data. Let’s first review the ways we learned to read text datasets before. In Lesson 6, we learned to use the native Dataset class in PyTorch to read datasets; in Lesson 23, we learned to use the torchtext.datasets tool to read datasets. Today, we will learn another new data reading tool: the Datasets library.

The Datasets library is also developed by the Hugging Face team and aims to provide easy access to and sharing of datasets. The official documentation can be found here. Interested students can go take a look to learn more.

The installation of the Datasets library is also very simple. It can be installed using pip:

pip install datasets

or using conda:

conda install -c huggingface -c conda-forge datasets

The Datasets library also includes common datasets and provides us with pre-packaged dataset loading operations. Let’s take a look at an example of loading training data from the IMDB dataset (covered in Lesson 23):

import datasets
train_dataset = datasets.load_dataset("imdb", split="train")
print(train_dataset.column_names)
'''
Output:
['label', 'text']
'''

We can use the load_dataset() function to load the dataset, with the parameter being the name of the dataset or the local file path, and the split parameter used to specify loading the training set, test set, or validation set.

We can also load data from more than one csv file:

data_files = {"train": "train.csv", "test": "test.csv"}
dataset = load_dataset("namespace/your_dataset_name", data_files=data_files)
print(datasets)
'''
Example output: (actual output may vary)
{train: Dataset({
    features: ['idx', 'text', 'summary'],
    num_rows: 3668
})
test: Dataset({
    features: ['idx', 'text', 'summary'],
    num_rows: 1725
})
}
'''

Simply specify the file paths to be loaded for the training set, test set, or validation set using the data_files parameter. We can use the map() function to perform some preprocessing operations on the dataset. Here’s an example:

def add_prefix(example):
    example['text'] = 'My sentence: ' + example['text']
    return example
updated_dataset = dataset.map(add_prefix)
updated_dataset['train']['text'][:5]
'''
Example output: 
['My sentence: ...', 'My sentence: ...', 'My sentence: ...', 'My sentence: ...', 'My sentence: ...']
'''

['My sentence: Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.',
'My sentence: Yucaipa owned Dominick\'s before selling the chain to Safeway in 1998 for $2.5 billion.',
'My sentence: They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.',
'My sentence: Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.',
]

'''

We first define an add_prefix() function that adds the prefix “My sentence: " to the “text” field of the dataset. Then we call the map method of the dataset to add the specified prefix to the content of the “text” field in the output.

Next, let’s take a look at how to use the fine-tuned BART model with a custom dataset. The specific code is as follows:

from transformers.modeling_bart import shift_tokens_right

dataset = ...  # The dataset object, which should contain the "text" and "summary" fields, and include the training set and validation set

def convert_to_features(example_batch):
    input_encodings = tokenizer.batch_encode_plus(example_batch['text'], pad_to_max_length=True, max_length=1024, truncation=True)
    target_encodings = tokenizer.batch_encode_plus(example_batch['summary'], pad_to_max_length=True, max_length=1024, truncation=True)
    
    labels = target_encodings['input_ids']
    decoder_input_ids = shift_tokens_right(labels, model.config.pad_token_id)
    labels[labels[:, :] == model.config.pad_token_id] = -100
    
    encodings = {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'decoder_input_ids': decoder_input_ids,
        'labels': labels,
    }

    return encodings

dataset = dataset.map(convert_to_features, batched=True)
columns = ['input_ids', 'labels', 'decoder_input_ids', 'attention_mask']
dataset.set_format(type='torch', columns=columns)

First, the custom dataset needs to be loaded. You should note that this dataset needs to contain both the original text and summary fields, and include the training set and validation set. The method for loading the dataset can be the load_dataset() function we just mentioned.

Since the loaded data needs to undergo a series of preprocessing operations, such as tokenization, before being fed into the model, we need to define a function convert_to_features() to process the original text and summary.

The main operation in the convert_to_features() function is to call the tokenizer to convert the text into token ids. It is worth noting that there is a shift_tokens_right() function in line 10 of the code, which is used for the Auto-Regressive method we mentioned earlier, and its purpose is to shift the input of the decoder backward by one position.

Then we need to call the dataset.map() function to perform preprocessing operations on the dataset, with the parameter batched=True indicating support for batched operations.

Finally, the set_format() function is used to generate the selected data fields required for training and generate PyTroch tensors. At this point, the data preparation work is complete.

Model Training #

After all the preparation work is done, let’s take a look at the model training part. The Transformers toolkit has already encapsulated the Seq2SeqTrainer class for training text generation models, so we don’t need to define the loss function and optimization method ourselves.

The specific training code is as follows.

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir='./models/bart-summarizer',  # Model output directory
    num_train_epochs=1,  # Number of training epochs
    per_device_train_batch_size=1,  # Batch size during training
    per_device_eval_batch_size=1,  # Batch size during evaluation
    warmup_steps=500,  # Learning rate related parameters
    weight_decay=0.01,  # Learning rate related parameters
    logging_dir='./logs',  # Log directory
)

trainer = Seq2SeqTrainer(
    model=model,  # The model to be trained
    args=training_args,  # Training related arguments
    train_dataset=dataset['train'],  # Training dataset
    eval_dataset=dataset['validation']  # Validation dataset
)

trainer.train()

First, we need to define an object of the training arguments, and all the related parameters for training are defined through the Seq2SeqTrainingArguments class. Then we initialize an object of the Seq2SeqTrainer class, passing in the model and training data as parameters. Finally, we call the train() method to start training with one click.

Summary #

Congratulations on completing today’s learning task and finishing all the learning content of PyTorch.

In this lesson, we first learned about the principles and features of the BART model. This model is a very practical pre-trained model that can help us generate text summaries. Then, with the help of examples, we learned how to quickly build an automatic text summarization project using PyTorch, including using the pipeline of Transformers to quickly generate text summaries and fine-tuning the BART model.

Because the BART model has a self-regressive Transformer structure, it can be used not only for summary generation but also for other text generation projects, such as machine translation, dialogue generation, and so on. With a basic understanding of its principles and the basic process of fine-tuning the model, you can easily use BART to accomplish text generation tasks. We look forward to seeing you apply this knowledge and conduct more experiments.

Through the practical projects we’ve discussed and implemented, we have explored two image projects and three natural language processing projects. By now, you should have a clear understanding of how to build your own deep learning network using PyTorch. When solving practical problems, we need to start with the principles, choose the appropriate model, and remember that PyTorch is just a tool to assist us in implementing the network we need.

In addition to automatic summarization, the common approach for the other four projects is to convert the problem into a classification problem. We don’t need to go into detail about image and text classification. Image segmentation is essentially determining which class a pixel belongs to, and sentiment analysis is determining whether a piece of text is positive or negative. Automatic summarization, on the other hand, is a generation model typically based on a sequence-to-sequence structure.

These are the model-building ideas I hope you have grasped through a series of practical training.

Thought Exercise #

Since the introduction of BERT in 2018, it has achieved great success, and the academic community has subsequently proposed various related models, such as the BART model we learned today. Please look up other models in the BERT series and read the relevant papers to learn about their principles and characteristics.

Feel free to interact with me in the comments section and share this lesson with more colleagues and friends to learn and progress together.