08 Torchvision Other Interesting Features

08 Torchvision Other Interesting Features #

Hello, I am Fang Yuan.

In the previous lessons, we have learned about Torchvision’s data loading and commonly used image transformation methods. In fact, besides providing us with pre-built datasets, Torchvision also offers various classic network architectures and pretrained models for deep learning. By directly instantiating these models, we can easily train or use them.

We can utilize these pretrained models to implement tasks such as image classification, object detection, video classification, and more.

Today, let’s learn how to instantiate classic network models and explore other interesting features in Torchvision.

Common Network Models #

Various classic network structures and pre-trained models in Torchvision are stored in the torchvision.models module. Now let’s take a look at what the torchvision.models module specifically provides and how to use these functionalities.

The `torchvision.models` module #

The torchvision.models module contains the definitions of common network models, which can solve four major classes of problems: image classification, image segmentation, object detection, and video classification. The schematic diagrams of image classification, object detection, and image segmentation are shown in the following figure.

Image classification refers to simply judging a picture as a certain class, for example, judging the first picture on the left in the above figure as a cat. Object detection means detecting the position of objects and recognizing their corresponding categories. In the middle picture above, we need to find the positions of cats, ducks, and dogs, as well as provide the category information for each object.

Let’s take a look at the example on the right side of the figure, which represents segmentation. Segmentation means classifying each pixel point in an image and determining the category of each point in order to perform region partitioning.

In the earlier versions of Torchvision, the torchvision.models module only included a subset of network models for image classification, such as AlexNet, VGG series, ResNet series, and Inception series. Just have a general impression of these networks for now, as I will explain their specific characteristics in detail later in the image classification section.

Now, with the continuous development of deep learning technology and the broader application of artificial intelligence, the network models encapsulated in the torchvision.models module are becoming more diverse. For example, in the current version (v0.10.0) of Torchvision, new networks for image semantic segmentation, object detection, and video classification have been added. In addition, in the image classification category, GoogLeNet, ShuffleNet, and MobileNet series suitable for mobile devices have also been added. These new models allow us to stand on the shoulders of giants to see the world.

Instantiate a GoogLeNet network #

When we directly instantiate a class of a network model, we obtain a network model. This network model class can be a structure defined by ourselves or a structure designed based on the papers of classic models. In fact, if you write a class based on the paper of a classic model and then instantiate it, the result will be the same as directly instantiating a network from Torchvision.

Let’s take the GoogLeNet network as an example to explain how to instantiate a network using the torchvision.models module.

GoogLeNet is a deep neural network model based on the Inception module, developed by Google. Don’t underestimate this model - GoogLeNet won the ImageNet competition in 2014 and can utilize computing resources more efficiently compared to previous structures such as AlexNet and VGG.

GoogLeNet is also known as Inception V1 and has been improved in the following two years, resulting in multiple versions such as Inception V2 and Inception V3.

We can create a GoogLeNet model with randomly initialized weights by using the following code:

import torchvision.models as models
googlenet = models.googlenet()

At this point, the GoogLeNet model is merely an instantiated network structure, and its parameters are randomly initialized. It needs to be trained before it can be used for prediction. Aside from predefined network structures, the torchvision.models module also provides us with pre-trained models that we can import and use directly. The code for importing a pre-trained model is as follows:

import torchvision.models as models
googlenet = models.googlenet(pretrained=True)

As you can see, by including the parameter pretrained=True during instantiation, we can obtain a pre-trained model. Torchvision has already encapsulated all the necessary work for us, making it very convenient to use. All pre-trained models in the torchvision.models module are trained on the ImageNet dataset, and they are provided by the torch.utils.model_zoo module of PyTorch. We can construct these pre-trained models by setting the pretrained=True parameter.

If you haven’t loaded a network with pre-trained parameters before, the parameters of the model will be downloaded to the cache directory when you instantiate a pre-trained model for the first time. This cache directory can be specified using the TORCH_MODEL_ZOO environment variable. Of course, you can also download your own pre-trained models and then copy them to the specified path.

The following image shows the result of running the instantiation code mentioned above. You can see that the parameters of the GoogLeNet model have been downloaded to the cache directory /root/.cache/torch.

The torchvision.models module also includes Inception V3 and other common network structures. When instantiating these models, you only need to modify the class name of the network to achieve various purposes. You can see all the models that can be instantiated in the torchvision.models module on this webpage.

Model Fine-tuning #

After completing the previous work, you may wonder what is the use of instantiating a pre-trained network. In addition to using it for direct prediction, you can also fine-tune the network model based on it.

So what is “fine-tuning”?

For example, suppose your boss assigns you a task on image classification, and the dataset consists of pictures of dogs. You need to classify the images based on the breed of the dog, such as Golden Retriever, Corgi, Border Collie, etc.

The problem is that there are many dog breeds in the dataset, but the data is limited. You find that training a model for image classification from scratch performs poorly and is prone to overfitting. How can this problem be solved? Then you think of using transfer learning, which means using a pre-trained model trained on the ImageNet dataset to achieve your goal.

For example, with the GoogLeNet model we instantiated earlier, all you need to do is to train the final classification layer of the network using your own dataset. By doing this, you can obtain a model for image classification that can distinguish different dog breeds. This is the so-called “fine-tuning” method.

Model fine-tuning refers to training a model on a relatively general and broad dataset to obtain a set of parameters, and then using this pre-trained network and parameters to train on your own task and dataset. Training with a pre-trained model usually yields better results, is easier to converge, and is faster than training with a randomly initialized model. It can achieve satisfactory results even on small datasets.

So why is model fine-tuning so effective? This is because we believe that two models that perform image classification tasks have some similarities in their network parameters. Therefore, transferring the well-trained model parameters to another model is also effective. Even if the two models do not perform exactly the same task, we can still achieve good results by fine-tuning the pre-trained parameters.

The ImageNet dataset has 1000 categories, but the number of dog breeds is far less than 1000. Therefore, after loading the pre-trained model, you need to make some adjustments to the model or data based on your specific problem. Usually, this involves adjusting the number of output categories.

Suppose there are a total of 10 dog breeds. Therefore, we need to adjust the output classification of the GoogLeNet model to 10 as well. The code to modify the pre-trained model is as follows:

import torch
import torchvision.models as models

# Load the pre-trained model
googlenet = models.googlenet(pretrained=True)

# Extract the input parameters of the classification layer
fc_in_features = googlenet.fc.in_features
print("fc_in_features:", fc_in_features)

# Check the output parameters of the classification layer
fc_out_features = googlenet.fc.out_features
print("fc_out_features:", fc_out_features)

# Modify the output classification of the pre-trained model (the torch.nn.Linear will be introduced in detail in the principle of image classification section)
googlenet.fc = torch.nn.Linear(fc_in_features, 10)
'''
Output:
fc_in_features: 1024
fc_out_features: 1000
'''

First, you need to load the pre-trained model, then extract the fixed parameters of the pre-trained model’s classification layer, and finally modify the output classification of the pre-trained model to 10. According to the output results, we can see that the original output classification of the pre-trained model is 1000.

Other Common Functions #

Previously in torchvision.transforms, we learned many functions related to image processing. Torchvision also provides some other commonly used functions, such as make_grid and save_img. Let’s take a look at what interesting features they can achieve.

make_grid #

The purpose of make_grid is to combine multiple images into a grid. Its definition is as follows.

torchvision.utils.make_grid(tensor, nrow=8, padding=2)

The meaning of the corresponding parameters in the definition is as follows:

tensor: A Tensor or a list. If the input is a Tensor, its shape should be (B x C x H x W); if the input is a list, the elements in the list should be images of the same size.
nrow: The number of images per row in the grid, default is 8.
padding: The width of the border between sub-images, default is 2 pixels.

The make_grid function is mainly used to display image results from datasets or model outputs. Let’s take the MNIST dataset as an example and combine what we have learned about loading the dataset and image transformation to see the effect of the make_grid function.

The following code uses the make_grid function to display 32 images from the MNIST test set.

import torchvision
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

# Load the MNIST dataset
mnist_dataset = datasets.MNIST(root='./data',
                               train=False,
                               transform=transforms.ToTensor(),
                               target_transform=None,
                               download=True)
# Take a tensor of 32 images
tensor_dataloader = DataLoader(dataset=mnist_dataset,
                               batch_size=32)
data_iter = iter(tensor_dataloader)
img_tensor, label_tensor = data_iter.next()
print(img_tensor.shape)
'''
Output: torch.Size([32, 1, 28, 28])
'''
# Combine 32 images into a grid
grid_tensor = torchvision.utils.make_grid(img_tensor, nrow=8, padding=2)
grid_img = transforms.ToPILImage()(grid_tensor)
display(grid_img)

Combining the code, we can see that the program first loads the MNIST test set using torchvision.datasets, then uses the iterator of the DataLoader class to get a tensor of 32 images at a time, and finally uses the make_grid function to combine the 32 images into one image. The 32 images from the MNIST test set are shown in the following image.

save_img #

Generally, when saving images from model outputs, we need to convert the Tensor data to image type before saving, which can be cumbersome. Torchvision provides the save_image function, which can directly save Tensor as an image. Even if the Tensor data is on CUDA, it will be automatically moved to CPU for saving.

The definition of the save_image function is as follows.

torchvision.utils.save_image(tensor, fp, **kwargs)

These parameters are also easy to understand:

tensor: A Tensor or a list. If the input is a Tensor, it is saved directly; if the input is a list, the make_grid function is called first to generate an image Tensor, and then the image is saved.
fp: The file name to save the image.
**kwargs: The parameters of the make_grid function, as mentioned earlier.

Continuing the previous example, let’s directly save the combined image of 32 images. The code is as follows.

# Save the grid tensor as an image
torchvision.utils.save_image(grid_tensor, 'grid.jpg')

# Save the image tensor after calling the grid_img function when the input is a list
torchvision.utils.save_image(img_tensor, 'grid2.jpg', nrow=5, padding=2)

When the input is a tensor of one image, it is saved directly. The saved image is shown below.

When the input is a list, the make_grid function is called first, and the parameters of the make_grid function are added directly after the function call. In the code, we set nrow=5. The saved image is shown below. In this image, each row contains 5 numbers, and the last row is padded with blank images.

Summary #

Congratulations on completing this lesson. With this, we have finished learning all the content of Torchvision.

The focus of today’s lesson was on the use of the torchvision.models module, including how to instantiate a network and how to fine-tune a model.

The torchvision.models module provides us with various classic network structures and pre-trained models in deep learning. We can not only instantiate a randomly initialized network model, but also instantiate a pre-trained network model.

Model fine-tuning allows us to quickly train models on our own small datasets and achieve satisfactory results. However, we need to make some modifications to the pre-trained model or data based on the specific problem. You can flexibly adjust the number of output classes or the size of input images.

In addition to model fine-tuning, I also discussed two interesting functions in Torchvision, make_grid and save_img. I demonstrated them to you by combining the content we previously learned about reading datasets and image transformations. I believe that using Torchvision tools in conjunction with PyTorch will definitely help you achieve more with less effort.

Practice for Each Lesson #

Please use the torchvision.models module to instantiate a VGG 16 network.

Feel free to leave a message in the comment section to discuss with me, and I recommend you to share this lesson with more colleagues and friends.

I’m Fang Yuan, see you in the next lesson!