18 Image Classification How to Build an Image Classification Model Part 2

18 Image Classification How to Build an Image Classification Model - Part 2 #

Hello, I am Fang Yuan. Welcome to the study of Lesson 18.

I believe that after studying the previous lesson, you have understood the principles of image classification and have gained some initial knowledge about classic convolutional neural networks.

As the saying goes, “Nothing can compare to hands-on practice.” Today, let us apply the theoretical knowledge from the previous lesson and work together to complete a complete image classification project, from data preparation, model training, to model evaluation.

You can download the course code from here.

Problem Review #

Let’s review the problem background first. The problem we need to solve is to automatically identify images containing the Geek Time logo from a large number of images. To achieve automatic recognition, we need to analyze what the images in the dataset look like.

Now let’s take a look at an image that includes the Geek Time logo, as shown below.

Image

As you can see, the logo occupies a relatively small proportion of the entire image. Therefore, if this project were a real one, object detection would be more suitable. However, we can modify the problem slightly to automatically recognize Geek Time promotional posters, which would be suitable for image classification tasks.

Data Preparation #

Compared to object detection and image segmentation, data preparation for image classification is relatively simple. In image classification, we just need to place each category of images in their respective folders.

The following image shows my organization of the images, where each folder represents a category.

Image

The “logo” folder contains 10 images of Geek Time posters.

Image

As for the “others” folder, in theory, it should contain various types of images. However, to simplify the problem, I have placed only cat images in this folder.

Image

Model Training #

Alright, the data is ready, and now we enter the model training phase.

Today, I’d like to introduce you to a popular network in the past 2 years called EfficientNet. It provides us with 8 different versions of models, ranging from B0 to B7. These 8 versions have different parameter sizes, and in models with the same number of parameters, EfficientNet achieves top-notch accuracy. Therefore, these 8 models can solve most of your problems.

EfficientNet #

Let me interpret the paper on EfficientNet for you. Here, I will focus on sharing the core ideas in the paper and my understanding of them. Students who have spare time can read the original paper after class.

EfficientNet consists of B0 to B7, 8 models in total, with parameters ranging from the least to the most and accordingly increasing accuracy. You can check the evaluation metrics further down.

In previous networks, people either optimized the network’s performance by focusing on depth or width of the network, but no one had combined these two aspects together. EfficientNet made this attempt by exploring the optimal combination between network depth, network width, and image resolution.

EfficientNet utilizes a compound scaling technique to simultaneously scale the network’s depth, width, and resolution (according to a certain scaling rule), in order to strike a balance between accuracy and computational complexity (FLOPS).

However, even when exploring these three dimensions, the search space is still large. Therefore, the authors restricted the scaling to only B0 (a baseline proposed by the authors).

Firstly, the authors compared the effects of scaling each dimension separately. They concluded that scaling the network depth, width, or image resolution alone can improve model accuracy, but the larger the scale, the slower the increase in accuracy, as shown in the figure below:

Image

Then, the authors conducted a second experiment to vary the width with different combinations of r (resolution) and d (depth), resulting in the following figure:

Image

The conclusion was that achieving higher accuracy and efficiency relied on balancing the scaling factors (d, r, w) of network width, network depth, and image resolution.

Therefore, the authors proposed a mixed-dimensional scaling method, which employs a \(\\phi\) (compound sparsity) to determine the scaling factors for the three dimensions.

Depth: \(d = \\alpha ^{\\phi}\)

Width: \(w = \\beta ^{\\phi}\)

Resolution: \(r = \\gamma ^{\\phi}\)

\[s.t. \\space \\alpha\\cdot\\beta^2\\cdot\\gamma^2 \\approx2 \\space \\space \\alpha \\geq1,\\beta \\geq1,\\gamma \\geq1\]

In the first step, with \(\\phi\) fixed at 1 (resulting in double computations), a grid search was performed to obtain the best combination, which was \(\\alpha=1.2, \\beta = 1.1, \\gamma = 1.15\).

In the second step, with \(\\alpha=1.2, \\beta = 1.1, \\gamma = 1.15\) fixed, different values of \(\\phi\) were tested to obtain B1~B7.

The overall evaluation results are shown in the following figure:

Image

From the evaluation results, we can see that all versions of EfficientNet surpass some previous classical convolutional neural networks.

EfficientNet v2 has also been proposed, so if you have time, you can take a look at it yourself.

Let’s make use of EfficientNet’s GitHub repo, which includes a demo (demo/imagenet/main.py) for training ImageNet. Next, let’s take a look at its core code together, and then simplify the code to run it (Torchvision also provides EfficientNet models, so you can try it yourself after class).

Now let’s recap the 3 key steps of machine learning:

  1. Data pre-processing - 2. Model training (constructing models, defining loss functions, and optimization methods) - 3. Model evaluation

Next, we’ll go through these steps one by one. You need to clone https://github.com/lukemelas/EfficientNet-PyTorch first, and then we’ll only use the contents in efficientnet_pytorch, which contains the network structure of the model.

Before we start, let me list the parameters that the program needs. We will directly use these parameters in the explanation below. When implementing the code later, you need to supplement these parameters into the code (you can use the argparse module). Image

Okay, let’s get started.

Loading Data #

First, let’s talk about loading the data. We create a dataset.py file to store content related to the data. Here’s the content of dataset.py (I omitted the module imports).

# Standardization method provided by the author
def _norm_advprop(img):
    return img * 2.0 - 1.0

def build_transform(dest_image_size):
    normalize = transforms.Lambda(_norm_advprop)

    if not isinstance(dest_image_size, tuple):
        dest_image_size = (dest_image_size, dest_image_size)
    else:
        dest_image_size = dest_image_size

    transform = transforms.Compose([
        transforms.RandomResizedCrop(dest_image_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        normalize
    ])

    return transform

def build_data_set(dest_image_size, data):
    transform = build_transform(dest_image_size) 
    dataset=datasets.ImageFolder(data, transform=transform, target_transform=None) 

    return dataset

This code builds the dataset through build_data_set. Here, we use torchvision.datasets.ImageFolder to create the dataset. ImageFolder can organize data in folders and generate a dataset from it.

In this example, I pass the training set path as './data/train', which you can see in the screenshot above.

ImageFolder will automatically assign labels based on the folders. So, all the data in the ’logo’ folder will be considered as one class, and the data in the ‘others’ folder will be considered as another class.

We only build the training set dataset in this simplified version. When you look at the EfficientNet official code, pay attention to the transformation of the validation set here.

I think there’s a problem here because the size parameter in Resize behaves differently depending on whether it’s a tuple or an integer. If it’s a tuple, the image is resized according to the given size. If it’s an integer, and the height of the image is greater than the width, the image is resized to (size * height/width, size).

In the original code by the author, image_size is an integer, not a tuple. So, for images with a large aspect ratio, this approach of first resizing and then cropping does not produce good results.

Let’s verify this idea. I set the image_size to 224 for the example at the beginning (the poster image) and processed it using the aforementioned method. Here’s the resulting image.

Image

As you can see, a lot of information is missing.

Therefore, if we use the author’s code in our example, we need to make some modifications. We need to change the logic of this code to first convert image_size to a tuple if it’s not already a tuple, and we don’t need to crop either. Here’s the modified code:

if not isinstance(image_size, tuple):
    image_size = (image_size, image_size)
else:
    image_size = image_size

transform = transforms.Compose([
    transforms.Resize(image_size, interpolation=Image.BICUBIC),
    transforms.ToTensor(),
    normalize,
])


The main program for training is defined in main.py. In the main() function in main.py, we load the data, as shown below.

Then, we train the model by calling the train method in a for loop for each epoch.

# Importing some modules
from efficientnet import EfficientNet
from dataset import build_data_set

def main():
    # part1: Load the model (to be added later)
    # part2: Define the loss function and optimization method (to be added later)
    train_dataset = build_data_set(args.image_size, args.train_data)

    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=True,
        num_workers=args.workers,
    )

    for epoch in range(args.epochs):
        # Call the train function for training (to be added later)
        train(train_loader, model, criterion, optimizer, epoch, args)
        # Save the model
        if epoch % args.save_interval == 0:
            if not os.path.exists(args.checkpoint_dir):
                os.mkdir(args.checkpoint_dir)
            torch.save(model.state_dict(), os.path.join(args.checkpoint_dir,
                'checkpoint.pth.tar.epoch_%s' % epoch))


### Create the model

Next, let's see how to create the model. In the part1 section of the above code comments, the EfficientNet model is loaded using the following code.

args.classes_num = 2
if args.pretrained:
    model = EfficientNet.from_pretrained(args.arch, num_classes=args.classes_num,
        advprop=args.advprop)
    print("=> using pre-trained model '{}'".format(args.arch))
else:
     print("=> creating model '{}'".format(args.arch))
     model = EfficientNet.from_name(args.arch, override_params={'num_classes': args.classes_num})
     # Add model.cuda() if you have a GPU
    
This code means that if the pretrained parameter is True, the pretrained model will be automatically downloaded and loaded for training. Otherwise, the model will be initialized with random numbers. In both the from_pretrained and from_name methods above, we need to modify the num_classes to the number of classes in our project (2 for logo and others) passed through args.classes_num.

#### Finetuning the model

We have mentioned the concept of finetuning the model in [Lesson 8](https://time.geekbang.org/column/article/431420) and [Lesson 14](https://time.geekbang.org/column/article/442442), which is an important concept. Let's review it together.

A pretrained model is usually trained on ImageNet (or sometimes COCO or VOC, which are publicly available datasets). We can directly use the model parameters trained on ImageNet and train our own model based on it, which is called model finetuning.

Therefore, **if there is a pretrained model, we will definitely use it for training, as it will converge faster**.

When using a pretrained model, there is one thing to note. After training on ImageNet, the fully connected layer has a total of 1000 nodes. Therefore, when using a pretrained model, we only use the parameters outside the fully connected layer.

In the EfficientNet.from_pretrained code mentioned above, the load_pretrained_weights function is called, and the load_fc parameter is set to False if the number of classes is not 1000. load_pretrained_weights is called as follows:

load_pretrained_weights(model, model_name, load_fc=(num_classes == 1000), advprop=advprop)

The load_pretrained_weights function contains the following code, as mentioned earlier, if the fully connected layer is not loaded, the weight and bias of the _fc layer will be removed:

if load_fc:
    ret = model.load_state_dict(state_dict, strict=False)
    assert not ret.missing_keys, 'Missing keys when loading pretrained weights: {}'.format(ret.missing_keys)
else:
    state_dict.pop('_fc.weight')
    state_dict.pop('_fc.bias')
    ret = model.load_state_dict(state_dict, strict=False)


### Set the loss function and optimization method

Finally, we need to define the loss function and optimization method. We will add the following code to the part2 section:

criterion = nn.CrossEntropyLoss() # Add .cuda() if you have a GPU

optimizer = torch.optim.SGD(model.parameters(), args.lr,
                            momentum=args.momentum,
                            weight_decay=args.weight_decay)


After this, we have completed all the preparations for training. We just need to add the train function, as shown below. The principle of the following code has been explained in [Lesson 13](https://time.geekbang.org/column/article/438639). If you don't remember, you can review it.

def train(train_loader, model, criterion, optimizer, epoch, args):
    # switch to train mode
    model.train()

    for i, (images, target) in enumerate(train_loader):
        # compute output
        output = model(images)
        loss = criterion(output, target)
        print('Epoch ', epoch, loss)

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


In my program, I saved several Epochs of the model. How should we choose? This brings us to the model evaluation phase.
## Model Evaluation

There are many evaluation metrics for classification models, such as accuracy, precision, recall, and F1-Score. Among them, **I think the most intuitive and persuasive metrics are precision and recall**, which are also the main metrics I observed in the project. Let's take a look at them one by one.

### Confusion Matrix

Before explaining precision and recall, let's take a look at the concept of a confusion matrix. In fact, precision and recall are calculated through it. The table below is a confusion matrix, where positive examples are logo class and negative examples are others class.

![Image](../images/fff9ea14f2124b04bd422222b2199f32.jpg)

Based on the combination of predicted results and true classes, there are four scenarios:

1. TP represents true positive, which means the true class is Logo and the model predicts it as Logo;
2. FP represents false positive, which means the true class is Others but the model predicts it as Logo;
3. FN represents false negative, which means the true class is Logo but the model predicts it as Others;
4. TN represents true negative, which means the true class is Others and the model predicts it as Others.

The calculation method for precision is:

\\[precision = \\\frac{TP}{(TP + FP)}\\]

The calculation method for recall is:

\\[recall = \\\frac{TP}{(TP + FN)}\\]

Precision and recall measure different performances of the model. Precision tells us the probability of a picture being a Logo class if the model predicts it as Logo. Recall measures how many Logo pictures the model can find in the whole validation set.

Now the question arises, how do we choose a model based on these two metrics? Different business needs focus on different metrics.

For example, in our project, if the boss allows some Logo pictures to go undetected, but the model must be accurate, meaning if the model says a picture is Logo class, there is a high probability that it is a Logo picture. In this case, we should focus on precision. If the boss wants to identify as many online Logo classes as possible, even if it means misclassifying some pictures, then we should focus on recall.

When calculating precision and recall, let me share my experience with you. In practical projects, I usually save the model's predictions for each picture in a txt file. This makes it more intuitive to screen bad cases of the model. Additionally, if the validation set is very large and needs adjustment, you can simply change the txt file without asking the model to predict the entire validation set again.

Below is a part of the txt file, which records the probabilities of the logo class and others class, whether the true class is Logo or others, whether the predicted class is Logo or others, and the filename.

14.jpeg is the picture from the example at the beginning, and the model's probability of it being Logo is 0.58476, while the probability of it being others is 0.41524.

```markdown
...
0.64460 0.35540 1 0 1 0 ./data/val/logo/13.jpeg
0.58476 0.41524 1 0 1 0 ./data/val/logo/14.jpeg
...

The following image shows the evaluation results of the B0 model trained for 10 epochs on the validation set (here, I used the training set as the validation set).

Image

From the confusion matrix, we can see that the entire validation set has a total of 8+0 pictures predicted as logo class, so the precision of the logo class is 8 / (8 + 0) = 1. The logo class has a total of 8+2 pictures, with two misclassified, so the recall is 8 / (8 + 2) = 0.8.

The calculation for the others class is similar. You can try calculating it yourself.

Summary #

Congratulations on completing today’s learning task. Today we have completed a practical project on image classification. Although the project is relatively small in scale, it includes every aspect of a real project. It can be said that although it is small, it is complete.

Now let’s review the key points and practical experience in each aspect.

Data preparation is actually the most crucial step, and the quality of the data directly determines the quality of the model. Therefore, before starting the training, you should have a thorough understanding of your dataset. For example, whether the validation set can reflect the training set, whether there is any dirty data in the dataset, whether the data distribution is biased, etc.

After completing the data preparation, it’s time to move on to model training. In image classification tasks, mainstream convolutional neural networks are usually used, and rarely are there any modifications to the model structure.

The final stage of model evaluation should focus on the business scenario, whether high precision or high recall is needed in the business, and then make adjustments to your model accordingly.

Thought Questions #

The boss wants your model to find as many Geek Time posters online as possible, allowing for some false alarms. When training the model, should you focus on precision or recall?

I recommend you to try implementing today’s demo and also encourage you to share this lesson with more colleagues and friends, so you can learn and progress together.