Advance Deep Learning with Alibaba Open-Source and Pluggable Scheduling Tool for GPU Sharing

Cluster Scheduling: Kubernetes GPU Sharing

GPU sharing for cluster scheduling is to let more model development and prediction services share GPU, therefore improving Nvidia GPU utilization in a cluster. This requires the division of GPU resources. GPU resources are divided by GPU video memory and CUDA Kernel thread. Generally, cluster-level GPU sharing is mainly about two things:

  • Extended resource definition
  • Scheduler extender mechanism
  • Device plugin mechanism
  • Kubectl extension mechanism

User Scenarios

  • A cluster administrator: “I want to improve the GPU utilization of the cluster. During the development, multiple users share the model development environment.”
  • An application developer: “I hope to be able to run multiple logic tasks on the Volta GPU at the same time.”


  • Users can describe the applying for a shared resource through API and schedule the resource.


  • Isolation of the shared resource is not supported.
  • Overselling is not supported.

Design Principle

1. Clarify the problem and simplify the design. In the first step, only scheduling and deployment are carried out, and then the control of run-time memory is implemented.
Many customers have a clear requirement to allow multi-AI applications to be scheduled to the same GPU. They can accept controlling the size of memory from the application level, and use gpu_options.per_process_gpu_memory_fraction to control the memory usage of the application. The first problem we need to solve is to simplify using memory as the scheduling scale, and transfer the size of the memory to the container in the form of parameters.

Detailed Design


1. The Extended Resource definition of Kubernetes is still used, but the minimum unit to measure dimension is changed from 1 GPU card to the MiB of GPU memory. If the GPU used by the node is a single-card 16 GiB memory, and its corresponding resource is 16276 MiB;

Core Function Modules:

  • GPU Share Scheduler Extender: It uses the Kubernetes scheduler extension mechanism to determine whether a single GPU card on the node can provide enough GPU Mem when the global scheduler filters and binds, and record the GPU allocation results to the Pod Spec through the annotation at the time of binding for the subsequent filtering to check the allocation results.
  • GPU Share Device Plugin: It uses the Device Plugin mechanism, which is called by Kubelet on the node, to allocate the GPU cards and execute based on the allocation result of the Scheduler Extender.

Detailed Process:

1. Resource reporting

  • To find the best GPU card ID in the node according to the binpack policy. The “best” here means that for different GPU cards in the same node, and taking the binpack policy as the determinant condition, the GPU card with the least remaining resources and the free resources satisfying the condition is preferentially selected, and saved as ALIYUN_COM_GPU_MEM_IDX in the annotation of the Pod. In addition, the GPU memory applied by the Pod is also saved as ALIYUN_COM_GPU_MEM_Pod and ALIYUN_COM_GPU_MEM_ASSUME_TIME to the annotation of the Pod, and the POD is bound to the selected node at this time.
  • To call the Kubernetes API to perform the binding of the node and the Pod.
  • All the GPU Share Pods in this node with Pending status and ALIYUN_COM_GPU_MEM_ASSIGNED to false are listed.
  • The Pod with the same number of ALIYUN_COM_GPU_MEM_POD (in the Pod Annotation) and Allocate applications is selected. If multiple Pods meet the condition, the POD with the earliest ALIYUN_COM_GPU_MEM_ASSUME_TIME is selected.
  • The ALIYUN_COM_GPU_MEM_ASSIGNED in the Pod Annotation is set to true, and the GPU information in the Pod Annotation is converted into an environment variable and returned to Kubelet to truly create the Pod.

Related Projects

Currently, the project has been open sourced in


See the Deployment documentation.

Test Sample

1. First, create an application that uses

apiVersion: apps/v1
kind: Deployment
name: binpack-1
app: binpack-1
replicas: 1
selector: # define how the deployment finds the pods it manages
app: binpack-1
template: # define the pods specifications
app: binpack-1
- name: binpack-1
image: cheyang/gpu-player:v2
# MiB 1024


See Usage documentation.


See How to build.


  • Optional support for Nvidia MPS is available in the Device Plugin;
  • The solution can be deployed automatically in the Kubernetes cluster initiated by kubeadm;
  • Scheduler Extener availability is improved;
  • A general solution for GPU, RDMA and flexible network cards is provided.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: