Answers Thinking Topics and Solution Summary

Answers Thinking Topics and Solution Summary #

Hello, I’m editor Yuxin. With the upcoming Spring Festival, I would like to wish you an early happy new year.

It has been quite some time since our column was last updated. Teacher Fangyuan still finds time during work to visit the column and check on the latest updates from the students. For most of the questions, the teacher has replied in the comment section.

In addition to the first batch of students who have been keeping up with the updates, it is also delightful to see more new friends joining in the learning process through this column. In order to give you enough time for reflection and research, we have chosen to release all the reference answers in one go as an “extra meal” supplement.

Here, I would like to remind you to first think and practice on your own before looking at the answers. There are hyperlinks for each lesson to facilitate your review.

Lesson 2 #

Question: In the previous question where users rated the games, can you calculate the average score for each user for the three games?

Answer:

>>>interest_score.mean(axis=1)

Lesson 3 #

Title: Given an array scores with shape (256, 256, 2), where the sum of elements at corresponding positions in scores[:, :, 0] and scores[:, :, 1] is 1, we need to generate an array mask based on scores. The requirement for mask is that if the value in channel 0 of scores is greater than the value in channel 1, the corresponding position in mask should be 0; otherwise, it should be 1.

Here is an example code implementation for scores:

scores = np.random.rand(256, 256, 2)
scores[:, :, 1] = 1 - scores[:, :, 0]

Answer:

mask = np.argmax(scores, axis=2)

Lesson 4 #

Question: In PyTorch, there are two functions, torch.Tensor() and torch.tensor(). What is the difference between them?

Answer: torch.Tensor() is a class in PyTorch. In fact, it is an alias for torch.FloatTensor(). Using torch.Tensor() will invoke the constructor of the Tensor class and generate a tensor of float type.

On the other hand, torch.tensor() is a function in PyTorch. Its prototype is torch.tensor(data, dtype...), where data can be a scalar, list, tuple, or other different data structures.

Lesson 5 #

Title: Now we have a Tensor, as follows.

>>> A=torch.tensor([[4,5,7], [3,9,8],[2,3,4]])
>>> A
tensor([[4, 5, 7],
        [3, 9, 8],
        [2, 3, 4]])

We want to extract the first element of the first row, the first and second elements of the second row, and the last element of the third row. How can we do it?

Answer:

>>> B=torch.Tensor([[1,0,0], [1,1,0],[0,0,1]]).type(torch.ByteTensor)
>>> B
tensor([[1, 0, 0],
        [1, 1, 0],
        [0, 0, 1]], dtype=torch.uint8)
>>> C=torch.masked_select(A,B)
>>> C
tensor([4, 3, 9, 4])

We just need to create a Tensor with the same shape as A, set the corresponding positions to 1, then convert the Tensor to torch.ByteTensor type to get B. Finally, perform the same masked_select operation as before.

Lesson 6 #

Question: In PyTorch, which class should we inherit from when defining a dataset?

Answer: torch.utils.data.Dataset

Lesson 7 #

Title: What is the role of the transforms module in Torchvision?

Answer: The transforms module in Torchvision is used for commonly applied image operations, such as random cropping, rotation, and converting between Tensor, Numpy, and PIL Image data types.

Lesson 8 #

Title: Instantiate a VGG 16 network using the torchvision.models module.

Answer:

import torchvision.models as models
vgg16 = models.vgg16(pretrained=True)

Lesson 9 #

Question: Think about it, can the stride be a value other than 1 when the padding is set to ‘same’?

Answer: No, it cannot.

Lesson 10 #

Title: Generate a random 128x128 feature map with 3 channels, then create a depthwise separable convolution with 10 convolution kernels of size 3x3 (DW convolution) to perform convolution calculation on the input data.

Answer:

import torch
import torch.nn as nn

# Generate a 128x128 feature map with 3 channels
x = torch.rand((3, 128, 128)).unsqueeze(0)
# Set the groups parameter of DW convolution to be the same as the number of input channels
dw = nn.Conv2d(x.shape[1], x.shape[1], 3, 1, groups=x.shape[1])
pw = nn.Conv2d(x.shape[1], 10, 1, 1)
out = pw(dw(x))
print(out.shape)

Lesson 11 #

Title: Is a smaller value for the loss function better?

Answer: No, the loss function we learned in this lesson is actually the average loss of the model on the training data. This type of loss function is called empirical risk. In fact, there is another aspect that we need to consider in practical work, which is the complexity of the model. Blindly pursuing the minimization of empirical risk can easily lead to overfitting of the model (you can review the previous content).

Therefore, we also need to constrain the complexity of the model, which we call structural risk. In actual development scenarios, the final loss function is composed of both empirical risk and structural risk. What we want is to minimize the sum of the two.

Lesson 12 #

Title: Is deep learning based on backpropagation?

Answer: No, mainstream deep learning models are based on backpropagation and gradient descent. However, there are also non-gradient descent second-order optimization algorithms such as quasi-Newton methods. However, the computational cost is very high, so they are less commonly used. In general, the industry mostly adopts the backpropagation and gradient descent approach.

Lesson 13 #

Title: Is a larger batch size better?

Answer: No. A larger batch size is prone to model convergence to local optima, while a smaller batch size is easily affected by noise.

Lesson 14 #

Title: Build your own Convolutional Neural Network for Image Classification based on CIFAR-10

Question: Please construct your own convolutional neural network based on CIFAR-10 and train an image classification model. Since you haven’t learned the principles of image classification yet, I have written the network structure for you and need you to complete the data loading, loss function (cross-entropy loss), and optimization method (SGD), etc.

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.RandomResizedCrop((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

cifar10_dataset = torchvision.datasets.CIFAR10(root='./data',
                                               train=False,
                                               transform=transform,
                                               target_transform=None,
                                               download=True)

dataloader = DataLoader(dataset=cifar10_dataset,
                        batch_size=32,
                        shuffle=True,
                        num_workers=2)

class MyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3)
        # conv1 outputs feature maps of size 222x222
        self.fc = nn.Linear(16 * 222 * 222, 10)

    def forward(self, input):
        x = self.conv1(input)
        # Flatten the feature maps before entering the fully connected layer
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x
    
cnn = MyCNN()
optimizer = torch.optim.SGD(cnn.parameters(), lr=1e-5, weight_decay=1e-2, momentum=0.9)

# Train for 3 epochs
for epoch in range(3):
    for step, (images, target) in enumerate(dataloader):
        output = cnn(images)

        loss = nn.CrossEntropyLoss()(output, target)
        print('Epoch: {} Step: {} Loss: {}'.format(epoch + 1 , step, loss))
        cnn.zero_grad()
        loss.backward()
        optimizer.step()

Lesson 15 #

Title: Based on the examples in the Visdom quick start guide, now we need to generate two sets of random numbers to represent Loss and Accuracy. In the process of iteration, how can we use code to simultaneously plot the two sets of data for Loss and Accuracy?

Answer:

from visdom import Visdom
import numpy as np
import time

# Instantiate the window
viz = Visdom(port=6006)
# Initialize window parameters
viz.line([[0.,0.]], [0.],
         win='train',
         opts=dict(title='loss&acc', legend=['loss','acc'])
         )

for step in range(10):
    loss = 0.2 * np.random.randn() + 1
    acc = 0.1 * np.random.randn() + 0.5
    # Update window data
    viz.line([[loss, acc]], [step], win='train', update='append')
    time.sleep(0.5)

The running result is shown in the figure below:

Image

Lesson 16 #

Title: In the torch.distributed.init_process_group(backend=“nccl”) function, what are the optional backends for the backend parameter and what are their differences?

Answer: The backend parameter specifies the communication backend, which includes NCCL, MPI, and gloo. NCCL is the official multi-GPU communication framework provided by Nvidia and is relatively efficient. MPI is also a commonly used communication protocol for high-performance computing, but it requires installation of an MPI implementation framework such as OpenMPI. gloo is a built-in communication backend, but it is not as efficient.

Lesson 18 #

Title: Your boss wants your model to find as many Geek Time posters as possible online, allowing for some false positives. When training the model, should you focus on precision or recall?

Answer: Focus on recall.

Lesson 19 #

Title: Is it possible to only output one feature map for the kitten segmentation problem discussed in this lesson?

Answer: Yes, it is possible. Since kitten segmentation is a binary classification problem, the output feature map can be transformed into a probability using the sigmoid function for judgment.

Lesson 20 #

Title: What are the evaluation metrics for image segmentation?

Answer: mIoU and Dice coefficient.

Lesson 21 #

Title: What are the shortcomings of TF-IDF? You may try to summarize it by combining the calculation process.

Answer: TF-IDF believes that words with low text frequency are more important, that is, more distinctive. However, in reality, this is not always true. For example, in a financial article, there is a sentence “股价就跟火箭一样上了天” (The stock price has skyrocketed). The word “火箭” (rocket) in this sentence will become very important, which is obviously incorrect. What can we do about it? Generally, we filter the word frequency according to certain conditions, such as exceeding a certain threshold. We also improve the formula of TF-IDF. If you are interested, you can search for relevant articles on the internet to find out the specific improvement methods.

Lesson 22 #

Title: What is the appropriate length of word vectors? Is longer better?

Answer: No, longer word vectors may provide a more precise representation of the spatial position of words, but they also bring problems such as increased computation complexity and data sparsity. Generally, we often choose lengths like 64, 128, or 256. The specific length needs to be determined through experiments. Some papers suggest that n should be greater than 8.33logN. Whether it is feasible or not depends on the actual situation.

Lesson 23 #

Question: Using the model trained today, write a function called predict_sentiment. It should take a sentence as input and output the sentiment category and probability of that sentence.

For example:

text = "This film is terrible!"
predict_sentiment(text, model, tokenizer, vocab, device)
'''
Output: ('neg', 0.8874172568321228)
'''

Answer: The reference code is as follows.

# Prediction process
def predict_sentiment(text, model, tokenizer, vocab, device):
    tokens = tokenizer(text)
    ids = [vocab[t] for t in tokens]
    length = torch.LongTensor([len(ids)])
    tensor = torch.LongTensor(ids).unsqueeze(dim=0).to(device)
    prediction = model(tensor, length).squeeze(dim=0)
    probability = torch.softmax(prediction, dim=-1)
    predicted_class = prediction.argmax(dim=-1).item()
    predicted_probability = probability[predicted_class].item()
    predicted_class = 'neg' if predicted_class == 0 else 'pos'
    return predicted_class, predicted_probability

# Load the model
model.load_state_dict(torch.load('lstm.pt'))

text = "This film is terrible!"
predict_sentiment(text, model, tokenizer, vocab, device)

Lesson 24 #

Title: What should be done when encountering long texts with a maximum length requirement of 512 for Bert text processing?

Answer: This is a very open question. Setting the maximum length to 512 is mainly for efficiency considerations, but there are still many solutions available. For example, we mentioned keyword extraction before. Another approach would be to selectively process a certain length of content from the beginning, middle, and end. However, these are relatively simple methods. Do you have any better ideas? Feel free to leave a comment for me.

Lesson 25 #

Title: Since the proposal of BERT in 2018, it has achieved great success. The academic community has successively proposed various related models, such as the BART we are learning today. Please search for other models in the BERT series and read the relevant papers to learn about their principles and characteristics.

Answer:

Lastly, I wish you a happy Year of the Tiger, progress in your learning, and smooth work!