Monitor GPU Metrics of a Container Service Kubernetes Cluster

Whenever you are training artificial intelligence (AI) models using an Alibaba Cloud Container Service for Kubernetes cluster constructed based on GPU ECS hosts, you need to know the GPU status of each pod. For example, you may need to know the video memory usage, GPU usage, and GPU temperature to ensure the stability of services. This document describes how to rapidly construct a GPU monitoring solution based on Prometheus and Grafana on Alibaba Cloud.

What Is Prometheus?

Prometheus is an open-source service monitoring system and a time series database. Since its inception in 2012 and open source placement on GitHub in 2015, Prometheus has attracted many companies and organizations. Prometheus joined the Cloud Native Computing Foundation (CNCF) in 2016 as the second hosted project, after Kubernetes. It graduated from the CNCF in August, 2018.

As a next-generation open-source solution, Prometheus has a lot of O&M ideas that happen to coincide with those of Google SRE.

Set Up Container Service for Kubernetes

Prerequisites: You have created a Kubernetes cluster consisting of GPU ECS hosts through Container Service.

Log on to the Container Service console and select Container Service — Kubernetes. Choose Application > Deployment and click Create by Template.

Select your GPU cluster and namespace. (For example, you can select the kube-system namespace.) Fill the YAML configuration template to deploy Prometheus and GPU-Exporter.

Deploy Prometheus

If you use a namespace other than kube-system, you need to modify serviceAccount in ClusterRoleBinding in the YAML file.

Deploy the Prometheus GPU-Exporter

Deploy Grafana

Choose Application > Service. Select the corresponding cluster and the kube-system namespace, and click the external endpoint of monitoring-grafana. The logon page of Grafana is displayed. Log on to Grafana using the initial username and password, which are both admin. You can change the password or add other accounts after successful logon. On the Dashboard, you can view the node and pod GPU monitoring information.

Node GPU Monitoring

Pod GPU Monitoring

Deploy Applications

If you already have Arena, use it to submit a training task.

After the task is submitted, you can see the pod deployed through Arena and the pod GPU monitoring information.

You can also see the GPU and load information about each node.


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.