17 Image Classification Principles and Models of Image Classification Part 1

17 Image Classification Principles and Models of Image Classification - Part 1 #

Hello, I’m Fang Yuan. Welcome to the study of image classification.

Through our previous lessons, we have gained a solid understanding of deep learning concepts related to PyTorch. In order to move beyond theory and engage in practical implementation, we will now focus on the most commonly used deep learning applications in the fields of computer vision and natural language processing. We will explore how these applications are implemented in the industry.

After completing this module, I believe you will not only consolidate your previous knowledge, but also be able to apply it to specific domains to analyze and solve problems.

Speaking of computer vision, one very common application is image classification. In fact, image classification is not far from our daily lives. Have you noticed that many smartphones automatically label the content of photos when taking pictures?

For example, take a look at the screenshot below. It shows that when I took a photo with my smartphone, it automatically recognized the content through the camera and labeled it as “cloudy”.

You may also notice that smartphones can recommend beautification options based on the recognized content. So how is this achieved? It turns out that this is one of the most commonly used, widely applied, and fundamental applications of convolutional neural networks: image classification.

Today, we will delve into the topic and explore what image classification is all about. In the next two lessons, I will guide you through the study of image classification. In this lesson, we will focus on the theoretical knowledge, understanding the principles of image classification and common convolutional neural network models. In the next lesson, based on what we learn today, we will complete a complete project on image classification.

Principles of Image Classification #

Let’s continue with the previous lesson on NumPy. Currently, a large number of images are uploaded online every day, and your boss has asked you to design a model that automatically detects images with the GeekTime logo.

To translate this requirement, we need to establish an image classification model that can automatically recognize images with the GeekTime logo.

Let’s clarify the functionality of this model. The model will receive an image as input and output a set of probabilities: the probability that the image is a logo and the probability that the image is any other category. By using these probabilities, we can determine whether the image belongs to the Logo class or the Other class. The diagram below illustrates this process:

Perceptron #

Let’s further break down the model and see how we can obtain such an output.

The input image, denoted as X, can be flattened to obtain X = {x_1, x_2, …, x_n}. The model can be regarded as having two nodes, each representing the judgment of the input image as Logo or Other. However, at this point, the output is not yet probabilities; it is just a set of numbers. The structure is shown in the figure below:

This structure is actually a perceptron. The green node in the middle is called a neuron, which is the basic unit of a perceptron. In the perceptron shown above, there is only one layer of neurons (green nodes). If there are multiple layers of neurons, it is called a multilayer perceptron.

What is a neuron? A neuron represents a linear transformation of the inputs. Each input x has a corresponding weight in the form of w_i1, w_i2, …, w_i_n. The calculation of y in the graph is as follows:

[y_i= \delta(w_{i1}x_{1} + w_{i2}x_{2} + … + w_{i_n}x_{n} + b_i), \space \space \space i=1,2]

Here, w_i1, w_i2, …, w_i_n are the weights of the neuron, and b_i is the bias of the neuron. Both the weights and the bias are learned parameters of the model. [\delta] represents the activation function, which is an optional parameter.

How can we convert a set of values, y_1 and y_2, into a set of corresponding probabilities? This is where the Softmax function comes in. Its role is to convert a set of values into corresponding probabilities, with the probabilities summing up to 1.

The calculation formula for Softmax is as follows:

[\\delta(x_j) = \frac{e^{x_j}}{\sum_{j=1}^{m}e^{x_j}}]

Please take a look at the code below. We use the Softmax function to transform the original output y into a set of probabilities:

import torch
import torch.nn as nn

# The values of y for 2 neurons are
y = torch.randn(2)
print(y)
# Output: tensor([0.2370, 1.7276])
m = nn.Softmax(dim=0)
out = m(y)
print(out)
# Output: tensor([0.1838, 0.8162])

After using Softmax, the original output y is transformed into a set of probabilities, with the sum of the probabilities equal to 1. The largest y in the original y has the highest probability.

Of course, Softmax is not necessary for every problem. Depending on the problem, we can use different functions. For example, sometimes we may use the sigmoid activation function, which converts a value into a probability between 0 and 1.

Now, let’s add the above process to our previous model, as shown in the diagram below:

Fully Connected Layer #

Actually, the diagram above represents the principle of image classification. Specifically, the green layer is called the fully connected layer or fc layer for short. It is usually located at the end of the network and is used to obtain the final output, which is the probabilities of different categories.

Because the number of neurons in the fully connected layer is fixed, the input images in networks with fully connected layers must have a fixed size. However, in reality, the images collected online can have different sizes. Therefore, we need to resize the images to a unified size so that PyTorch can process them further.

Let’s assume that we resize the input images to 128x128 and see how the inference process of the fully connected layer is implemented in PyTorch:

x = torch.randint(0, 255, (1, 128*128), dtype=torch.float32)
fc = nn.Linear(128*128, 2)
y = fc(x)
print(y)
# Output: tensor([[ 72.1361, -120.3565]], grad_fn=<AddmmBackward>)
# Note that the shape of y is (1, 2)
output = nn.Softmax(dim=1)(y)
print(output)
# Output: tensor([[1., 0.]], grad_fn=<SoftmaxBackward>)

From the code, we can see that the fully connected layer in PyTorch is implemented using nn.Linear. Let’s take a look at some important parameters:

in_features: The number of input features, which is 128x128 in this example.
out_features: The number of output features, which is 2 in this example.
bias: Whether a bias term is needed. The default value is True.

The input to the fully connected layer is not the raw image data, but rather the features extracted through multiple convolutional layers.

As mentioned earlier, some networks can accept inputs of any scale. In the design we discussed earlier, the inputs x1 to xn of the fully connected layer are fixed and equal to the number of elements in the last feature map. The diagram below illustrates this:

By slightly modifying the structure mentioned above, we can make the network accept inputs of any scale. We just need to add a global average pooling after the last feature map. This means averaging each feature map and replacing it with the average value. This way, the amount of data entering the fully connected layer is fixed regardless of the input scale.

As shown in the diagram below, the yellow circle represents the result of global average pooling:

In the next lesson, we will introduce EfficientNet, which uses this approach to allow the network to be trained with images of any scale.

Convolutional Neural Networks #

The multilayer perceptron mentioned earlier is actually the predecessor of Convolutional Neural Networks (CNNs). Due to its shortcomings (large number of parameters and difficult training), it remained stagnant for a period of time in history until the emergence of CNNs, which broke the deadlock.

The main function of CNNs is to extract rich information from input images and then connect it to higher-level applications, such as image classification mentioned earlier. When applying CNNs to image classification principles, the resulting model is shown in the following figure:

You need to pay attention to the definition of each layer in the diagram, as different layers have different names.

In the above diagram, the focus of the entire model or network is on the CNN block, so this is also our main focus.

So how do we find a suitable CNN? In practical work, we rarely design a neural network from scratch (because there are too many uncontrollable variables), but directly use networks designed by experts. With so many network models, how can we verify the reliability and usability of the networks proposed by these experts?

ImageNet #

In the industry, there is a benchmark called ImageNet, which is used to evaluate the quality of proposed models.

ImageNet itself contains a very large dataset, and since 2010, it has held the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC) annually, which includes tasks such as image classification, object detection, and image segmentation.

In the image classification competition, a large dataset with 1000 categories is used. Any model that stands out in this competition is considered a classic network structure, and these networks are basically our preferred choices for practical projects.

Since 2012, with the development of deep learning, almost every year, very classic network structures have been born. The table below shows the Top-5 error rates on ImageNet over the years.

You may wonder if it is really necessary to understand so many network models.

What I want to say is that preparation ensures success, and the field of machine learning has always been driven by research. In our work, we rarely create a network model from scratch. We often make custom configurations based on classic designs, so it is best for you to have an understanding of these classic networks.

Next, let’s select a few classic neural networks to take a look at.

VGG #

VGG achieved excellent results by ranking 2nd in the ILSVRC 2014 classification project and 1st in the localization project.

In that year, VGG provided a total of six different VGG networks, labeled from A to E (the letters only indicate different depths). Although VGG19 had the best performance, VGG16 was used more frequently in practical projects due to its overall model size and other indicators. You can look at the specific network structure in the paper.

Let’s take a look at some key breakthroughs of VGG:

It demonstrated that as the depth of the model increases, the performance of the model also improves.
It used smaller 3x3 convolutions to replace the large convolutions of 11x11, 7x7, and 5x5 used in AlexNet.

Regarding the second point, VGG replaced the 5x5 convolution with two layers of 3x3 convolution and replaced the 7x7 convolution with three layers of 3x3 convolution. This approach firstly reduces the number of parameters in the network and secondly, increases the depth of the network while maintaining the same receptive field, thereby extracting more diverse nonlinear information.

GoogLeNet #

The champion of the 2014 classification competition was GoogLeNet (same year as VGG). The core of GoogLeNet is the Inception module. This period’s Inception module is the v1 version, followed by v2, v3, and v4.

Let’s first look at what problem GoogLeNet solved. Researchers found that for images of the same category, the size of the main object varies in different images, as shown in the figure below.

If AlexNet or VGG are used with standard convolution, each layer can only extract features from the image using convolution kernels of the same size.

However, as shown in the above figure, objects may appear in different sizes in the image. Can we extract different features using convolution at different scales? Based on this idea, the Inception module was developed, as shown in the figure:

From the figure, we can see that the method of using convolution of the same size to extract features has been split, using 1x1, 3x3, 5x5, and 3x3 max pooling at the same time to extract features, and then merging them together. This achieves the extraction of features from the image using a multi-scale approach.

In order to reduce the computational cost of the network, the author of the Inception module made an improvement by adding 1x1 convolution before 3x3 and 5x5, and adding 1x1 convolution after pooling to reduce dimensionality. This resulted in the final form of the Inception module.

Here is an additional piece of information: in interviews, you will often be asked why 1x1 convolution is used or what its purpose is. The purpose of 1x1 convolution is to either increase or decrease the dimensionality.

GooLeNet is a 22-layer network composed of the Inception module described above. Despite having 22 layers, it has fewer parameters than AlexNet and VGG, which means that the model is small and occupies less storage space. For the specific network structure, you can refer to its research paper.

ResNet #

ResNet stands for Residual Neural Network. In the 2015 ImageNet competition, the model’s classification performance surpassed human-level accuracy, with a top-5 error rate of 3.57% for 1,000-class image classification.

In their paper, the authors presented ResNet models with 18, 34, 50, 101, and 152 layers. The 101-layer and 152-layer ResNet models achieved the best results, but due to hardware and inference time limitations, the 50-layer ResNet model is more commonly used in practical projects.

If you are interested, you can read the full paper for the specific network structures. Here, I will focus on the main breakthroughs of this network.

Network Degradation Problem #

Although research has shown that increasing the depth of a network can improve its overall performance, simply increasing the network depth can lead to two problems: overfitting and the vanishing/exploding gradient problem.

Although these two problems can be solved with continued research, the authors of the ResNet network found that even after avoiding these two problems, simply stacking convolutional layers still did not yield good results.

To verify this observation, the authors conducted an experiment by building a regular 20-layer convolutional neural network and a 56-layer convolutional neural network, and tested them on the CIFAR-10 dataset. The 56-layer network performed worse than the 20-layer network in terms of both training set and test set error rates. The following figure is from the paper.

This phenomenon was considered to be network degradation.

Network degradation refers to the phenomenon where, as a network’s depth increases, its accuracy gradually reaches saturation and then rapidly declines, even when the network can initially converge. The reason for the decline in accuracy is not overfitting, as the accuracy of the 56-layer network on the training set should be higher than that of the 20-layer network if it is overfitting.

The authors considered this phenomenon unreasonable. Assuming that the 20-layer network is optimal, it should theoretically be possible for the remaining 36 layers after increasing the depth to learn an identity mapping, meaning that theoretically, it should not learn a worse network than the 26-layer network. Therefore, the authors speculated that the network cannot easily learn an identity mapping (a mapping where f(x) = x).

Residual Learning #

As mentioned earlier, from the problem of network degradation, it can be observed that it is difficult for simply stacking convolutional layers to learn an identity mapping. In order to address the problem of network degradation, the authors of the paper proposed a framework called deep residual learning.

Since the network does not easily learn an identity mapping, the authors forced the addition of an identity mapping, as shown in the following figure (figure from the paper).

This was achieved by using a mechanism called a shortcut connection. In a residual neural network, the shortcut connection is an identity transform, as shown in the figure with the “x identity” curve. The set of layers that includes the shortcut connection is called a residual block.

A residual block is defined as follows:

\[y = F(x, W_i) + x\]

F can be a 2-layer or 3-layer convolutional layer. In the end, the authors found that through residual blocks, deeper and more excellent convolutional neural networks could be trained.

Summary #

Congratulations on completing this lesson! Let’s review the main content of this lesson.

We started with the Multilayer Perceptron, introducing you to the predecessor of Convolutional Neural Networks. Then, we derived the basic model for image classification. It’s important to note that the focus of the entire model or network is on the Convolutional Neural Network, so that’s where our focus should be as well.

Next, we looked at some classic network structures, such as VGG, GoogLeNet, and ResNet, based on the evaluation results of the industry benchmark ImageNet. To help you quickly grasp the key points, I explained each network’s problem-solving approach and breakthroughs. I also recommend that you spend more time reading related papers to gain a more detailed and in-depth understanding.

Looking at the development of network structures, we can see that the new generation always outperforms the previous one. By mastering these network structures, you will be a future player in deep learning. In the next lesson, we will work on a practical image classification project together to deepen your understanding of image classification. Stay tuned!

Reflection Questions #

Please recommend any neural network models that you find impressive in recent years.

Feel free to communicate and interact with me in the comments section. Also, I encourage you to share this lesson with more colleagues and friends.