GPU-Enabled Container Instances for Faster Performance

By Xian Wei

Alibaba Cloud Container Service for Kubernetes (ACK), specifically Alibaba Cloud’s ACK Serverless service, has now added official support for GPU-enabled container instances. These instances, based on existing Alibaba Cloud Elastic Container Instances (ECI), allow for running AI-computing tasks on the cloud in a serverless mode more quickly. They also minimize the O&M burden of AI platform while also significantly improving overall computing efficiency.

It is now very much an industry consensus that AI-computing is synonymous with the computing power of GPUs. However, building a GPU-enabled cluster environment from scratch can be a difficult task. It involved the purchasing GPUs, preparing the relevant machines, installing drivers, as well as installing the relevant container environment.

The serverless delivery mode of GPU-enabled resources fully demonstrates the core advantages of going Serverless. It provides users with standardized and immediate resource supply capabilities so that you do not need to purchase machines or log on to the node to install the GPU Driver. This can in turn greatly reduce the deployment complexity of the AI platform, allowing you to focus on the development and maintenance of AI models and applications rather than the infrastructure involved.

More to it, this ultimately also makes using GPU/CPU resources as simple and convenient as turning on a faucet. With these applications, you can use resources using the pay-as-you-go billing method allowing you to avoid high costs, purchasing resources that you may not ultimately use.

So, let’s get a bit more into the nitty gritty. Creating a GPU-mounted pod in ACK Serverless is actually a relatively simple process. You just need to specify the type or specifications of the GPU you require for your services through annotation, and specify the number of GPUs in resource.limits. You can also specify the instance-type. Each pod occupies the GPU exclusively — in other words, know that vGPU is not supported for the time being. Charges that incur for GPU-enabled instances are the same as those for ECS GPU type instances. No additional charges are incurred. Currently, Alibaba Cloud ECI provides the following types of GPU:

Below is a screenshot of the console.

As you can see creating one of these instances is really quite simple.

Now in the reminder of this blog, we are going to take a quick look at an AI model that involves a simple image recognition algorithm. Through this example, you can learn how these new GPU-enabled instances can be powerful items in your tool set. They can quickly compute the deep learning tasks involved in these ACK Serverless services.

Use TensorFlow for an Image Recognition Algorithm

Recognizing what is shown in this image is a very easy task for you or me, but it is not necessarily an easy task for machines. To recognize the panda in the picture, a computer requires a large amount of data input for an appropriate algorithm model to be trained. These algorithms can be pretty intensive for a regular CPU instance. Let’s try to have the system recognize the panda in this image based on the existing TensorFlow model.

Here, we choose the example for getting started with TensorFlow.

The image is built based on the TensorFlow official image tensorflow/tensorflow:1.13.1-gpu-py3, and the models repository required for the example has been downloaded from:

In the Serverless cluster console, create the following yaml file based on the template, or use kubectl to deploy it. In the pod, set the GPU type to P4 and the number of GPUs to 1.

Create the pod and wait for the execution to complete. Check the pod log:

According to the pod log, the model has detected a panda in the picture. In the entire machine learning and computing process, we only run a pod. When the pod becomes terminated, the task is completed. We did not prepare the ECS environment, did not buy any GPUs, did not install Nvidia GPU drivers, did not install docker software, and the computing, like water and electricity, was used on demand.


The virtual nodes in ACK also implement the support for GPUs based on ECI in the same way as ACK Serverless, but the pod must be specified and scheduled to the virtual node, or create the pod in the namespace with the virtual-node-affinity-injection=enabled label. The virtual node-based method can support various deep learning frameworks flexibly, such as Kubeflow, Arena, or other custom CRDs.

Consder the following example:

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.