14 Building Networks One Stop Implementation of Model Construction and Training

14 Building Networks One-Stop Implementation of Model Construction and Training #

Hello, I am Fang Yuan.

In the previous lessons, we spent a lot of time learning about the data part, as well as studying the optimization methods, loss functions, and convolution calculations of the models. You may feel that these knowledge points are still a bit scattered, but in fact, unconsciously, we have already mastered the essential content of model training.

In today’s lesson, we will have a mid-term practice session, which is a good opportunity to test our learning progress. I will guide you to use PyTorch to build and train your own model.

Here’s how I have arranged it. First, I will explain the essential basic module for constructing networks - the nn.Module module, which is about how to build a network on your own and train it. In other words, we will understand how networks like VGG and Inception are trained. Then, we will also explore how to leverage pretrained models from Torchvision to train our own models.

Building Your Own Model #

Let’s get straight to the point and use PyTorch to build and train a linear regression model to fit the trend distribution in the training set.

Let’s start by randomly generating the training set X and the corresponding labels Y. The specific code is as follows:

import numpy as np
import random
from matplotlib import pyplot as plt

w = 2
b = 3
xlim = [-10, 10]
x_train = np.random.randint(low=xlim[0], high=xlim[1], size=30)

y_train = [w * x + b + random.randint(0,2) for x in x_train]

plt.plot(x_train, y_train, 'bo')

After organizing the data generated by the above code into a scatter plot, the figure is shown as below:

Image

For those who are familiar with regression, we know that our regression model is: \(y = wx+b\). Here, x and y correspond to x_train and y_train in the above code, and w and b are the parameters we want to learn.

Alright, let’s see how to build this model. Let’s look at the code first and then explain it in detail.

import torch
from torch import nn

class LinearModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.weight = nn.Parameter(torch.randn(1))
    self.bias = nn.Parameter(torch.randn(1))

  def forward(self, input):
    return (input * self.weight) + self.bias

Through the example of this linear regression model, we can introduce several important points when building the network.

  1. It must inherit the nn.Module class.

  2. Override the init() method. Usually, put the layers with parameters to be learned in the constructor. For example, the weight and bias in the example, as well as the convolutional layers we have learned before. In the above init (), we use nn.Parameter(), which is mainly used as the trainable parameters in nn.Module.

  3. forward() method must be overridden. You can understand it by looking at the function name. It is used to define how the model calculates the output, which is the forward propagation. In our example, it is to obtain the final output of the calculation result y=weight * x+bias. For some layers that do not need to learn parameters, they are generally placed here. For example, BN layer, activation function, and Dropout.

nn.Module module #

nn.Module is the base class for all neural network modules. When we design a network structure ourselves, we need to inherit this class. In other words, the models in Torchvison are also built by inheriting the nn.Module module.

It is worth noting that the module itself is callable, and when it is called, it executes the forward function, that is, the forward propagation.

Let’s still perceive it intuitively with code examples. Please see the code below, create an instance of LinearModel called model, and then model(x) is equivalent to calling the forward method in LinearModel.

model = LinearModel()
x = torch.tensor(3)
y = model(x)

As we have discussed in our previous lesson, the model calculates gradients through forward and backward propagation, and then updates the parameters. I think by now, there should not be many people willing to write code for backpropagation and gradient updating.

At this time, the advantages of PyTorch are reflected. When you train, PyTorch’s automatic differentiation mechanism will automatically perform these complicated calculations for you.

In addition to the above-mentioned content, about the initialization method init , what you need to pay attention to is that you must call the constructor of the parent class , which is this line of code:

super().__init__()

Because in nn.Module’s init (), some ordered dictionaries and sets are initialized. These sets are used to store some intermediate variables in the model training process. If we do not initialize these parameters in nn.Module , the model will report the following error:

AttributeError: cannot assign parameters before Module.__init__() call

Model Training #

After our model is defined, it has not been trained yet. To train our model, we need to use a loss function and an optimization method. This part has been covered in the previous lessons (if you feel unfamiliar, you can review lessons 11 to 13), so now we can just look at the code.

Here, we choose the MSE loss function and the SGD optimization method.

model = LinearModel()
# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, weight_decay=1e-2, momentum=0.9)

y_train = torch.tensor(y_train, dtype=torch.float32)
for _ in range(1000):
    input = torch.from_numpy(x_train)
    output = model(input)
    loss = nn.MSELoss()(output, y_train)
    model.zero_grad()
    loss.backward()
    optimizer.step()

After 1000 epochs of training, we can print the weights and biases of the model to see what they are.

For the trainable parameters of a model, we can use named_parameters() to view them. Please see the code below.

for parameter in model.named_parameters():
  print(parameter)
# Output:
# ('weight', Parameter containing:
# tensor([2.0071], requires_grad=True))
# ('bias', Parameter containing:
# tensor([3.1690], requires_grad=True))

We can see that the weight is 2.0071 and the bias is 3.1690. If you go back and check the w and b we created for the training data, are they the same?

As we said before, after inheriting an nn.Module, we can define our own network model. Module can also be included as part of another Module and be included in the network. For example, if we want to design a network like this:

Image

By observing the picture, it is easy to find that there are many repeated structures in this network. The 3x3 and 2x2 convolutions in the above picture, according to our previous explanation, we need to define each convolution in init() and then define the execution method in forward. For example, pseudo code:

class CustomModel(nn.Module):
  def __init__(self):
    super().__init__()
self.conv1_1 = nn.Conv2d(in_channels=1, out_channels=3, kernel_size=3, padding='same')
self.conv1_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=2, padding='same')
...
self.conv_m_1 = nn.Conv2d(in_channels=1, out_channels=3, kernel_size=3, padding='same')
self.conv_m_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=2, padding='same')
...
self.conv_n_1 = nn.Conv2d(in_channels=1, out_channels=3, kernel_size=3, padding='same')
self.conv_n_2 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=2, padding='same')

def forward(self, input):
    x = self.conv1_1(input)
    x = self.conv1_2(x)
    ...
    x = self.conv_m_1(x)
    x = self.conv_m_2(x)
    ... 
    x = self.conv_n_1(x)
    x = self.conv_n_2(x)
    ...
    return x

Actually, this repeated structure can be put into a separate module, and then called directly in our model. You can refer to the following code for specific implementation:

class CustomLayer(nn.Module):
    def __init__(self, input_channels, output_channels):
        super().__init__()
        self.conv1_1 = nn.Conv2d(in_channels=input_channels, out_channels=3, kernel_size=3, padding='same')
        self.conv1_2 = nn.Conv2d(in_channels=3, out_channels=output_channels, kernel_size=2, padding='same')
        
    def forward(self, input):
        x = self.conv1_1(input)
        x = self.conv1_2(x)
        return x

Then, the CustomModel will look like this:

class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = CustomLayer(1, 1)
        ...
        self.layerm = CustomLayer(1, 1)
        ...
        self.layern = CustomLayer(1, 1)
      
    def forward(self, input):
        x = self.layer1(input)
        ...
        x = self.layerm(x)
        ...
        x = self.layern(x)
        ...
        return x

Those familiar with deep learning must have heard of residual blocks, Inception blocks, and other similar combinations. If you haven’t, it’s okay. I will explain them in image classification. Here, you just need to know that similar to these combination blocks, we can use the same code structure provided above to implement them.

Model Saving and Loading #

The purpose of training our model is to provide services for other applications, which involves model saving and loading.

There are two ways to save and load models. The PyTorch model’s file extension is usually .pt or .pth, but it doesn’t matter. It’s just a file extension. Now, let’s continue with our regression model to understand the model’s saving and loading.

Way 1: Saving only the trained parameters #

The first way is to save only the trained parameters. When loading the model, you need to load the network structure in your code and then assign the parameters to the network.

The code for saving only parameters is as follows:

torch.save(model.state_dict(), './linear_model.pth')

The first parameter is the model’s state_dict, and the second parameter is the path to save the model.

The state_dict in the code is a dictionary that is automatically generated after the model is defined, which stores the trainable parameters of the model. We can print the state_dict of our linear regression model as follows:

model.state_dict()
Output: OrderedDict([('weight', tensor([[2.0071]])), ('bias', tensor([3.1690]))])

The way to load the model is as follows:

# Define the network structure first
linear_model = LinearModel()
# Load the saved parameters
linear_model.load_state_dict(torch.load('./linear_model.pth'))
linear_model.eval()
for parameter in linear_model.named_parameters():
    print(parameter)
Output:
('weight', Parameter containing:
tensor([[2.0071]], requires_grad=True))
('bias', Parameter containing:
tensor([3.1690], requires_grad=True))

Pay attention to the model.eval() here because the states of some layers (such as Dropout and BatchNorm) are different during training and evaluation. When entering the evaluation phase, you need to execute model.eval() to put the model into evaluation mode. Here, evaluation not only refers to evaluating the model but also includes the state of the model when it is deployed.

Way 2: Saving the network structure and parameters together #

Compared to the first method, in this method, you don’t need to load the network structure when loading the model. The specific code is as follows:

# Save the entire model
torch.save(model, './linear_model_with_arc.pth')
# Load the model, no need to create the network again
linear_model_2 = torch.load('./linear_model_with_arc.pth')
linear_model_2.eval()
for parameter in linear_model_2.named_parameters():
    print(parameter)
# Output:
('weight', Parameter containing:
tensor([[2.0071]], requires_grad=True))
('bias', Parameter containing:
tensor([3.1690], requires_grad=True))

After this operation, if you successfully output the corresponding values and they are consistent with the parameters of the saved model, it means that the loading is correct.

Training with Models in Torchvision #

As we mentioned earlier, Torchvision provides some pre-trained network structures that we can directly use. However, we haven’t talked about how to train them on our own dataset. Today, let’s take a look at how to use these network structures to train our own models on our own data.

Fine-tuning Again #

Actually, the main function of the models provided by Torchvision is to serve as pre-trained models during training, to accelerate the convergence of our models. This is what we call fine-tuning.

The most crucial step in fine-tuning is adjusting the number of outputs of the last fully connected layer, as we mentioned before. Torchvision only reproduces various network structures, rather than providing a unified encapsulation for them. Therefore, different networks have different methods for modifying the fully connected layers.

But don’t worry, this modification is not complicated. You just need to print out the network structure to know how to modify it. Let’s take AlexNet as an example to try fine-tuning.

We actually mentioned fine-tuning when talking about Torchvision before. At that time, we fixed the parameters of the entire network and only trained the last fully connected layer. Today, I will introduce another way of fine-tuning, which is to modify the fully connected layer and train the entire network from scratch. However, this time, we need to use the pre-trained model’s parameters as the initialized parameters. This method is more commonly used.

Next, let’s see how to fine-tune using models in Torchvision.

First, import the model. The code is as follows:

import torchvision.models as models
alexnet = models.alexnet(pretrained=True)

If you cannot “surf the Internet scientifically”, this step may be slower. You can manually download it based on the url prompts in the command, and then use the method of loading pre-trained models discussed today to load the pre-trained model. The code is as follows:

import torchvision.models as models
alexnet = models.alexnet()
alexnet.load_state_dict(torch.load('./model/alexnet-owt-4df8aa71.pth'))

To verify whether the loading is successful, let it make predictions on the image below: Image

The code is as follows:

from PIL import Image
import torchvision
import torchvision.transforms as transforms

im = Image.open('dog.jpg')

transform = transforms.Compose([
    transforms.RandomResizedCrop((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

input_tensor = transform(im).unsqueeze(0)
alexnet(input_tensor).argmax()
Output: 263

After running the previous code, you can find the corresponding label in the ImageNet class labels. The label 263 corresponds to Pembroke, which proves that the model has been successfully loaded. There are two key points to pay attention to during this process. First of all, all the pre-trained models for image classification in Torchvision are trained on ImageNet. Therefore, the input data needs to be a 3-channel tensor with a shape of (B, 3, H, W), where B is the batch size. We need to normalize the data using a mean of [0.485, 0.456, 0.406] and a standard deviation of [0.229, 0.224, 0.225].

Moreover, theoretically, most classical convolutional neural networks use fully connected layers (also known as perceptrons in machine learning) for classification, which means that the input size of the network is fixed. However, the models in Torchvision can accept inputs of arbitrary sizes.

This is because Torchvision has optimized the models, some of which adopt global average pooling or use fully convolutional networks in the last convolutional layer. Both of these approaches allow the network to accept inputs of any scale on top of the minimum input size. You may not fully understand this point at the moment, but don’t worry, you will have a deeper understanding after we learn about image classification theory in the future.

Now let’s get back to fine-tuning. As mentioned earlier, training an AlexNet model requires three-channel data. So, here I use the CIFAR-10 dataset as an example.

The CIFAR-10 dataset consists of 60,000 images divided into 10 categories, with each category containing 6,000 images. Each image is a 32x32 RGB image. Out of the 60,000 images, 50,000 are used as the training set and 10,000 are used as the test set.

It can be said that the CIFAR-10 dataset is very close to real-world project data, as real-world data is usually RGB three-channel data, just like CIFAR-10.

We can use the make_grid method we discussed earlier to visualize the CIFAR-10 dataset. Here is the code:

cifar10_dataset = torchvision.datasets.CIFAR10(root='./data',
                                               train=False,
                                               transform=transforms.ToTensor(),
                                               target_transform=None,
                                               download=True)
# Taking a tensor of 32 images
tensor_dataloader = DataLoader(dataset=cifar10_dataset,
                               batch_size=32)
data_iter = iter(tensor_dataloader)
img_tensor, label_tensor = data_iter.next()
print(img_tensor.shape)
grid_tensor = torchvision.utils.make_grid(img_tensor, nrow=16, padding=2)
grid_img = transforms.ToPILImage()(grid_tensor)
display(grid_img)

Please note that in the above code, for the purpose of visualizing the images, I only used transform.ToTensor() to output the images. The result is shown below:

cifar-10-images

I would like to point out that since the images in this dataset are all 32x32, what you see now is the original image effect. The size of the image does not affect our learning.

Next, let’s modify the fully connected layer. We can simply print the network structure with the following code:

print(alexnet)
Output:
AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
(classifier): Sequential(
  (0): Dropout(p=0.5, inplace=False)
  (1): Linear(in_features=9216, out_features=4096, bias=True)
  (2): ReLU(inplace=True)
  (3): Dropout(p=0.5, inplace=False)
  (4): Linear(in_features=4096, out_features=4096, bias=True)
  (5): ReLU(inplace=True)
  (6): Linear(in_features=4096, out_features=1000, bias=True)
)
)

As you can see, the input of the final fully connected layer is 4096 units, and the output is 1000 units. We need to modify it to a fully connected layer with 10 units of output (CIFR10 has 10 classes). The code is as follows:

# Extract input parameters of the classification layer
fc_in_features = alexnet.classifier[6].in_features

# Modify the output classification number of the pretrained model
alexnet.classifier[6] = torch.nn.Linear(fc_in_features, 10)
print(alexnet)
Output:
AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=10, bias=True)
  )
)

As you can see, the output is now 10 units.

Next is to train our own model using AlexNet as the pretrained model on CIFAR-10. First, we need to read the data. The code is as follows:

transform = transforms.Compose([
    transforms.RandomResizedCrop((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
cifar10_dataset = torchvision.datasets.CIFAR10(root='./data',
                                              train=False,
                                              transform=transform,
                                              target_transform=None,
                                              download=True)
dataloader = DataLoader(dataset=cifar10_dataset,
                        batch_size=32,
                        shuffle=True,
                        num_workers=2)

Note that I have changed the transform and resized the image to 224x224. This size is the recommended minimum training size in Torchvision. The model is the modified AlexNet, and the rest of the training is the same as what we discussed before. First, define the optimizer, the code is as follows:

optimizer = torch.optim.SGD(alexnet.parameters(), lr=1e-4, weight_decay=1e-2, momentum=0.9)

Then start training the model. The code should be familiar to you:

# Train for 3 epochs
for epoch in range(3):
    for item in dataloader:
        output = alexnet(item[0])
        target = item[1]
        # Use cross entropy loss function
        loss = nn.CrossEntropyLoss()(output, target)
        print('Epoch {}, Loss {}'.format(epoch + 1 , loss))
        # The meaning of the code below has been explained in our previous article
        alexnet.zero_grad()
        loss.backward()
        optimizer.step()

In this fine-tuning method, all parameters need to be retrained.

For the first method (fixing all parameters and only training the last fully connected layer), after loading the pretrained model, set all the parameters before the fully connected layer to be untrainable, which means they cannot be trained. We only train the fully connected layer when training the model, everything else remains the same. The code is shown below:

alexnet = models.alexnet()
alexnet.load_state_dict(torch.load('./model/alexnet-owt-4df8aa71.pth'))
for param in alexnet.parameters():
    param.requires_grad = False

That’s it! Now you can try it out yourself.

Summary #

Today’s content mainly revolves around how to build a neural network model on our own. We introduced the nn.Module module and some of its associated methods.

Based on the ideas I shared with you in this lecture, you will be able to quickly build a model for training and validation when you have any ideas.

In fact, in practical development, we rarely build networks from scratch. Most of the time, we directly use some pre-built classic networks, such as those in Torchvision. When you encounter models that have not been encapsulated into PyTorch, the knowledge you learned today can help you directly learn from the work of others and train your own models.

Finally, combining with my own learning and research experience, I will provide some learning clues for students who are interested in learning more about deep learning. So far, we have only talked about the convolutional layer, but there are many other layers in a network, such as Dropout, Pooling layer, BN layer, activation function, etc. Dropout function, Pooling layer, and activation function are relatively easy to understand, while the BN layer may be slightly more complex.

Also, observant friends may have noticed that when printing the structure of the AlexNet network, part of it was built using nn.Sequential. nn.Sequential is a fast way to build networks, and with the knowledge you gained from this lesson, you will find it very simple. I recommend you to take a look at it.

Practice for each lesson #

Please build your own convolutional neural network based on CIFAR-10 dataset and train an image classification model. Since you haven’t learned the principles of image classification yet, I have written the structure of the network for you. You just need to complete the data loading, loss function (cross-entropy loss), and optimization method (SGD).

import torch
import torch.nn as nn
import torch.optim as optim

class MyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3)
        # The size of the output feature map of conv1 is 222x222
        self.fc = nn.Linear(16 * 222 * 222, 10)

    def forward(self, input):
        x = self.conv1(input)
        # Flatten the feature map before entering the fully connected layer
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x

# Load CIFAR-10 data here

# Create an instance of the network
model = MyCNN()

# Define loss function
criterion = nn.CrossEntropyLoss()

# Define optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    # Get data batches from the data loader
    
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    
    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Testing/validation loop

# Save the trained model
torch.save(model.state_dict(), 'my_model.pth')

Feel free to leave comments and discuss with me in the message area. If this lesson is helpful to you, I also recommend sharing it with more colleagues and friends to learn and progress together.