18 Gpu Machine Learning Ready to Use

18 GPU Machine Learning Ready-to-Use #

Introduction to ECI GPU #

Compared to regular ECI instances, ECI GPU instances provide GPU resources to accelerate tasks like machine learning for user containers, as shown in the above figure. ECI GPU instances come pre-installed with GPU drivers, eliminating the need for users to install and maintain GPU drivers. Additionally, ECI GPU instances are compatible with the CRI interface, allowing Kubernetes to schedule and orchestrate ECI GPU instances directly. Furthermore, by utilizing official container images, users don’t need to worry about setting up and deploying CUDA Toolkit/TensorFlow/PyTorch, etc., and can focus on developing and implementing specific business functions.

With ECI GPU instances, users can deploy and run GPU-accelerated machine learning tasks with just a few clicks, making it simple and convenient.

Basic Implementation Principle of ECI GPU #

As we know, containers generally access host resources through kernel interfaces. However, containers cannot directly access GPU resources through kernel interfaces; instead, they can only interact with GPUs through vendor drivers.

So, how does an ECI GPU instance enable user containers to access GPU resources? Essentially, ECI GPU mounts necessary dynamic library files of GPU drivers to user containers when they are created, allowing the user containers to access the GPUs located on the host side through these mounted dynamic library files.

The basic implementation framework of ECI GPU is shown in the above figure, where all the components represented by boxes run on the ECI HostOS side. The ContainerAgent, a self-developed component, can be likened to Kubelet and it accepts instructions from control. The nvidia-container-runtime-hook in the upper right corner is an NVIDIA open-source implementation of a prestart hook that conforms to the OCI standard. The prestart hook is used to perform custom configuration operations before executing user-specified commands in the container. libnvidia-container, in the middle-right position, is also an open-source component developed by NVIDIA that mounts the host-side GPU driver libraries to the specified containers.

Let’s briefly introduce the container startup process in ECI GPU:

containerd receives the command to create a container that uses GPUs from the ContainerAgent and issues the command to runc.
When runc creates the container, it calls the prestart hook, nvidia-container-runtime-hook.
The hook calls libnvidia-container to mount necessary GPU driver dynamic libraries, such as libcuda.so and libnvml.so, onto the container.
After the container is created, the user container process can access and use GPU resources through the aforementioned mounted dynamic library files.

How to Use ECI GPU #

Currently, to use GPUs in ACK/ASK clusters, you only need to specify two fields in the YAML file, as indicated in the red box in the above figure.

The first field is k8s.aliyun.com/eci-use-specs, which specifies the ECI GPU instance specification. The available ECI GPU instance specifications on Alibaba Cloud are listed in the table on the left side of the figure.

The second field is nvidia.com/gpu, which specifies the number of GPUs to be used by the container. Note that the sum of the numbers of GPUs specified by all containers in the spec cannot exceed the number of GPUs available in the ECI GPU instance specification specified by the k8s.aliyun.com/eci-use-specs field, otherwise, container creation will fail.

Demonstration #

To view the video demonstration process, please click the video lesson link to learn.

Finally, let’s briefly demonstrate how to use GPUs to accelerate machine learning tasks in an ACK cluster. We will take the example of MNIST (handwritten digit recognition) training in an ASK cluster:

This task is defined by a YAML file, as shown in the above figure. We specify the ECI GPU instance type in the YAML file, which includes an NVIDIA P4 GPU. Then we specify the container image as nvcr.io/nvidia/pytorch, which is provided by NVIDIA and already contains pre-packaged tools such as CUDA/PyTorch. Lastly, we specify the number of GPUs to be used as 1 using nvidia.com/gpu.

As shown in the above figure, in an ASK cluster, we choose to create an application instance using a template, and then enter the content of the YAML file on the right-hand side of the template, and finally click Create to create a container that uses GPUs.

After the container is created, firstly we log in to the container we created using the kubectl command, and then execute the nvidia-smi command to confirm whether the GPU is available. As shown in the top-left screenshot in the figure, the nvidia-smi command successfully returns information about the GPU, such as the model being P4, driver version 418.87.01, CUDA version 10.1, etc., indicating that the container we created is able to use GPU resources normally.

Next, as shown in the right-hand side screenshot in the figure, we navigate to the /workspace/examples/mnist directory and execute python main.py to start the MNIST training task. The MNIST training task will download the MNIST dataset first, which may take some time due to the large size of the dataset. After the dataset is downloaded, the MNIST training task starts training the dataset.

After the MNIST task is completed, we will see the training results printed on the screen, as shown in the bottom-left screenshot in the figure. The MNIST test set contains 10,000 test images, and from the result image, we can see that 9,845 of the handwritten digit images were correctly identified, which is a high level of accuracy. Interested students can compare the training time of the MNIST task without using GPUs.

Summary #

In summary, ECI GPU not only greatly accelerates the execution of tasks like machine learning in the cloud but also its maintenance-free deployment features allow users to focus on implementing specific business functions without worrying about the underlying environment, achieving true out-of-the-box functionality and facilitating user development. Considering user demands for computing power, we will also have vGPU instances available for users to choose from in the future, further reducing user costs. Stay tuned!