Experience RAPIDS Data Science Acceleration in Alibaba Cloud Container Service

Image for post
Image for post

By Biran

Preface

Algorithms, data, and computing power are the three major components of AI. Without computing power, AI cannot be implemented. The computing power of the NVIDIA GPU makes it the top choice for accelerating AI model training, but it is expensive. The solution composed of Alibaba Cloud Container Service, Elastic Container Instance (ECI), and Arena provides an easy, effective, and cost-effective way to take advantage of the computing power of NVIDIA GPU.

Deep Learning is the first thing that strikes anyone while mentioning the NVIDIA GPU. Traditional machine learning and data analysis methods rarely use GPUs. In fact, NVIDIA has an excellent project called RAPIDS, which is a GPU-accelerated library launched by NVIDIA for data science and machine learning. For more information about RAPIDS, visit the official RAPIDS website. This project aims to bring GPU acceleration to traditional algorithms and provides the same operations and user experience as Pandas and scikit-learn. RAPIDS has three modules: cuDF is equivalent to Pandas, cuML is equivalent to scikit-learn, and cuGraph is used to process graph data. Given its sound compatibility, combine RAPIDS with the deep learning framework, use cuDF to accelerate data processing through a GPU, and then use the TensorFlow and PyTorch deep learning model frameworks to perform tasks.

Image for post
Image for post

This article shows how to use TensorFlow and RAPIDS to perform search by image on Alibaba Cloud Container Service and use ECI to apply for GPU resources. GPU resources are prepared in seconds and released when no longer needed. This implies that there is no need to prepare GPU instances in advance. Additionally, there is no need to deal with the Kubernetes infrastructure. Run Arena commands to build and run a RAPIDS environment that contains GPUs and manage the GPU infrastructure.

Procedure

Step 1: Prepare the Cluster

To begin with, prepare a managed Kubernetes cluster. A managed Kubernetes cluster controls nodes running on Alibaba Cloud resources, with Alibaba Cloud being responsible for operations and maintenance (O&M) costs. Meanwhile, this cluster creates a virtual Kubelet node.

In case, a Kubernetes cluster for container service is already created, select a managed Kubernetes cluster.

As the system component container needs to be run, the node must contain at least one worker node.

1) For more information on installing virtual nodes, see Virtual Nodes Documentation.

2) Configure virtual-kubelet-autoscaler. If the GPU resources in the cluster are insufficient, use virtual-kubelet-autoscaler to remove ECI container groups that are using the GPU.

Step 2: Run Arena to Create the RAPIDS Service

1) Install Arena using the commands below.

$ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.3.0-b556a36-linux-amd64.tar.gz
$ tar -xvf arena*.tar.gz
$ cd arena-installer
$ ./install.sh

2) Now, run an Arena command to view the cluster’s GPU resources. As the following snippet shows, this user’s cluster has one real node and does not contain GPU resources. In addition, a virtual node is present, which does not physically exist and therefore is not billed. This node provides unlimited GPU resources, which can be scaled.

$ arena top node
arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-shanghai.192.168.1.248 192.168.1.248 <none> ready 0 0
virtual-kubelet 172.20.2.18 agent ready 1000 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1000 (0%)

3) Before submitting the RAPIDS task, complete the necessary preparation to accelerate the creation process and simplify access operations.

3.1 Set the method to access LoadBalancer. Note that this method is used only for the sake of simplicity. We recommend that you disallow external IP addresses to access the production environment.

$ find /charts/ -name "*.yaml" | xargs sed -i "s/NodePort/LoadBalancer/g"

3.2 Next, accelerate the startup speed using the step below.

3.2.1 The GPU container image is normally very large in size. For example, the RAPIDS container image used in this experiment has a capacity of 14.7 GB. Generally, the startup time is about 10 minutes. However, the image caching function may reduce this time to 20 seconds.

docker images | grep rapids
registry.cn-shanghai.aliyuncs.com/tensorflow-samples/rapids-samples 0.8.2-cuda10.0-runtime-ubuntu16.04 4597a0334d41 12 days ago 14.7GB

3.2.2 In serverless Kubernetes, just create an ImageCache CRD to directly use the image caching function.

$ cat > imagecache.yaml << EOF
apiVersion: eci.alibabacloud.com/v1
kind: ImageCache
metadata:
name: imagecache-rapids
spec:
images:
- registry.cn-shanghai.aliyuncs.com/tensorflow-samples/rapids-samples:0.8.2-cuda10.0-runtime-ubuntu16.04
imageCacheSize:
50
EOF

3.2.3 Wait a moment after submission. Check the status of ImageCache. The CACHEID may use the snapshot-id specified earlier while submitting a previous task.

$ kubectl get imagecache
NAME AGE CACHEID PHASE PROGRESS
imagecache-rapids 3d9h imc-uf6dxdji7txxxxx Ready 100%

4) Submit the RAPIDS development environment as shown below.

$ arena serve custom \
--name=rapids \
--selector=type=virtual-kubelet \
--toleration=all \
--annotation=k8s.aliyun.com/eci-image-snapshot-id=imc-uf6dxdji7txxxxx \
--annotation=k8s.aliyun.com/eci-instance-type=ecs.gn5i-c8g1.2xlarge \
--gpus=1 \
-e=PASSWORD=mypassw0rd \
--restful-port=80 \
--image=registry.cn-shanghai.aliyuncs.com/tensorflow-samples/rapids-samples:0.8.2-cuda10.0-runtime-ubuntu16.04
configmap/rapids-201912011815-custom-serving created
configmap/rapids-201912011815-custom-serving labeled
service/rapids-201912011815 created
deployment.extensions/rapids-201912011815-custom-serving created

Let’s take a quick look at the commnds used in the preceding code snippet.

  • --selector=type=virtual-kubelet: It indicates that the pod is started by using a virtual node.
  • --annotation=k8s.aliyun.com/eci-instance-type=ecs.gn5i-c8g1.2xlarge: It specifies the ECI container group type.
  • ecs.gn5i-c8g1.2xlarge: It refers to the Alibaba Cloud P4 model. For more information about the detailed specifications, see the relevant document.
  • --annotation=k8s.aliyun.com/eci-image-snapshot-id=imc-uf6dxdji7txxxxx: It specifies the CACHEID in Step 3.2.3.
  • -e=PASSWORD=mypassw0rd: It indicates that RAPIDS notebook is accessed by setting the PASSWORD environment variable.
  • --gpus=1: It indicates the number of GPUs applied for.

5) Now, view the access address, which is a combination of ENDPOINT_ADDRESS and PORTS. In this example, its value is 106.15.173.2:80. At the same time, see that this task switches to the Running state in 32 seconds.

$ arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ENDPOINT_ADDRESS PORTS
rapids CUSTOM 201911181827 1 1 105.13.58.3 restful:80

6) Check the GPU usage of the cluster again. Note that the GPU resources are already being used.

$ arena top node
NAME IPADDRESS ROLE STATUS GPU(Total) GPU(Allocated)
cn-shanghai.192.168.1.248 192.168.1.248 <none> ready 0 0
virtual-kubelet 172.20.2.20 agent ready 1000 1
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/1000 (0%)

7) To query the pods that use this GPU, append “-d” to the original command to view specific pod names.

$ arena top node -d

8) Use the access address and port from Step 4 to visit the address in the local browser. Enter http://{ENDPOINT ADDRESS}:{ENDPOINT PORT}. In this example, the address is http://105.13.58.3:80.

Note: We recommend using the Chrome browser.

9) Enter the logon password set in the preceding command and then click on the Log in button. In this example, the password is mypassw0rd.

Image for post
Image for post

Step 3: Perform the Image Search Demo

1) Go to the cuml directory where the demo is located.
2) Double-click the cuml_knn.ipynb file.
3) Click on start icon.

Note: Click once to run one cell. Click until the demo is fully executed. For more detailed instructions, refer to the following section on Demo execution process.

Image for post
Image for post

Demo Execution Process

The image search demo process contains the following three steps. The demo results demonstrate the different performance of GPU-accelerated RAPIDS cuml KNN and scikit-learn KNN that uses only CPUs.

1) Process the Dataset

1.1) Download and Decompress the Dataset

In this demo, the STL-10 dataset contains 100,000 unlabeled images with the dimensions of 96 x 96 x 3. Use other datasets to extract image features, but make sure that these datasets use images of the same size.

Use the download_and_extract(data_dir) method to download and decompress the STL-10 dataset. In the RAPIDS image, the dataset is downloaded to the ./data directory. Use the download_and_extract() method to decompress the dataset.

Image for post
Image for post

1.2) Read the Images.

Data decompressed from the dataset is binary. Use the read_all_images(path_to_data) method to load and convert the data to the NHWC format (batch, height, width, and channels). This format allows Tensorlow to extract image features.

Image for post
Image for post

1.3) Display an Image.

Use the show_image(image) method to display a random image from the dataset.

Image for post
Image for post

1.4) Split the Dataset.

Split the dataset into two parts at a 9:1 ratio. One part is used to create image index libraries, and the other is used to search for images.

Image for post
Image for post

2) Extract Image Features

Use TensorFlow and Keras to extract image features. Use the pre-trained model ResNet50 (notop), which is based on the ImageNet dataset.

2.1) Set TensorFlow Parameters.

By default, TensorFlow uses all GPU memory capacities. Reserve some GPU memory capacities for cuML. Use the following methods to set the GPU memory parameters:

  • Method 1: Allocate the memory according to operational requirements.
config.gpu_options.allow_growth = True
  • Method 2: Set a ratio to determine the amount of memory that is used by TensorFlow.

This demo uses method 2 and sets the ratio to 0.3. This indicates that TensorFlow may use 30% of the GPU memory. Adjust the ratio as needed.

config.gpu_options.per_process_gpu_memory_fraction = 0.3
Image for post
Image for post

2.2 Download the pre-trained model ResNet50 (notop). Connect TensorFlow to the public network and download the model. The size of the model is about 91 MB. The model will be downloaded to the /root/.keras/models/ directory.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Now, run the model.summary() method to view the network structure of your model.

Image for post
Image for post

2.3 Extract Image Features

Call the model.predict() method on the split datasets to extract image features.

Image for post
Image for post

3) Search for Similar Images

3.1 Use cuML KNN to Search for Similar Images.

Set K to 3 (k=3) to search for the three most similar images. Adjust the value of K as needed. Use the knn_cuml.fit() method while creating indices. Use the knn_cuml.kneighbors() method while searching for neighbors.

Image for post
Image for post

It takes 791 milliseconds for KNN to retrieve vectors.

Use scikit-learn KNN to search for similar images. Set K to 3 (n_neighbors=3), and specify n_jobs=-1 to use all the CPUs to search for the nearest neighbors.

Note: The ecs.gn5i-c8g1.2xlarge model is configured with 8 vCPUs.

Image for post
Image for post

It takes 7 minutes and 34 seconds for KNN to retrieve vectors.

Compare the search results of cuML KNN and scikit-learn KNN. Compare the vector retrieval speeds of cuML KNN and scikit-learn KNN. GPU-accelerated cuML KNN takes only 791 milliseconds, whereas scikit-learn KNN that uses CPUs takes 7 minutes and 34 seconds. The retrieval speed of cuML KNN is nearly 600 times faster than that of scikit-learn KNN.

Check whether the search results of cuML KNN and scikit-learn KNN are the same. Compare the following output arrays:

  • Distance: The K smallest distance values. K indicates the number of distance values. This demo searches 10,000 images. The value of K is 3. Therefore, distance.shape=(10000,3).
  • Indices: The corresponding image indices. indices.shape=(10000, 3).

The dataset used in this demo contains identical images. Identical images may have different indices. Therefore, you must use distances instead of indices to compare the results. A calculation deviation may exist. If the deviation of the three smallest distance values for the 10,000 images is smaller than 1 in both methods, the results are considered to be identical.

Image for post
Image for post

Image Search Results

This demo selects five random images from the 10,000 images and uses them to search for similar images. The search results are displayed in five rows and four columns.

The first column displays the original five images. The second, third, and fourth columns display similar images. From the second column to the fourth column, the similarity between the original and similar images decreases. The title of each similar image is the calculated distance. A larger value indicates a lower similarity.

Image for post
Image for post

Step 4: Clean Up

Execute the following commands to clean up.

$ arena serve delete rapids
service "rapids-201912011815" deleted
deployment.extensions "rapids-201912011815-custom-serving" deleted
configmap "rapids-201912011815-custom-serving" deleted
INFO[0000] The Serving job rapids with version 201912011815 has been deleted successfully

Summary

This article describes how to experience fast and simple use of the RAPIDS accelerated data science solution at a low cost through Arena and Alibaba Cloud Serverless Kubernetes.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store