Auto Scaling Kubernetes Clusters Based on GPU Metrics

In a deep learning system, trained models provide services through the Serving service. This document describes how to construct a Serving service that supports auto scaling in a Kubernetes cluster.

Kubernetes uses Horizontal Pod Autoscaler (HPA) to perform auto scaling based on metrics such as CPU and memory metrics by default. The native Heapster-based HPA module of Kubernetes does not support auto scaling based on GPU metrics, but supports auto scaling based on custom metrics. You can deploy a Prometheus Adapter as a CustomMetricServer. This server will provide the Prometheus metric registration API for HPA to call. You can then configure HPA to use GPU metrics as custom metrics for auto scaling.


  1. A Container Service Kubernetes cluster has been created.
  2. A GPU monitor has been deployed as described in this guide.
  3. A Prometheus Adapter has been deployed for monitoring GPU metrics. The monitored data in the Prometheus Adapter will be used as a reference for auto scaling.

Note: After the native Heapster-based HPA module is configured to use custom metrics for auto scaling, it will be unable to use CPU and memory metrics for auto scaling.


Log on to the active nodes and execute the script for generating the Prometheus Adapter certificate.

Deploy the Prometheus Adapter as a CustomMetricServer.

Assign permissions to roles. If you use a namespace other than custom-metric, you need to change the value of the namespace parameter in the template.

Call the APIServer through the CustomMetricServer to verify that the Prometheus Adapter has been successfully deployed as a CustomMetricServer.

Modify the controller-manager configurations to use custom metrics for auto scaling. Log on to each of the three active nodes and execute the script for modifying the HPA configurations on the APIServer.

Test the modified configurations.

Auto Scaling Metrics

After a Prometheus CustomMetricServer is deployed, use the configMap named adapter-config to configure the metrics that are to be exposed by the Prometheus CustomMetricServer to the APIServer. The following GPU metrics are supported:

Auto Scaling Based on GPU Metrics

Deploy the deployment

Create an HPA that supports auto scaling based on GPU metrics.

View HPA metrics and their values.

Use the fast-style-transfer algorithm to deploy a stress testing application. This application will send images to Serving continuously.

After the application is deployed, you can observe the GPU metric changes on the monitoring panel.

You can also observe the metric changes through HPA.

After the test has started, you can see that the pod has been scaled out.

You can also see the scaled-out pod and GPU metrics on the monitoring panel.

Stop the stress testing application. Run the following command to stop the stress testing application:

(You can also perform the scaling operation on the console.)

Verify that the value of dutyCycle has been changed to 0 in HPA.

After a period of time, check whether the pod has been scaled in.

To learn more about Alibaba Cloud Container Service for Kubernetes, visit


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store