GPU Sharing Scheduler Extender Now Supports Fine-Grained Kubernetes Clusters


Kubernetes services of major container cluster service vendors around the world all provide the capability to schedule Nvidia GPU containers, but it is generally implemented by allocating a GPU card to a container. This allows better isolation and ensures that applications using GPU are not affected by other applications. It is suitable for deep learning model training scenarios, but it would be a waste for model development and model prediction scenarios. The demand is to allow more prediction services to share the same GPU card, thus improving Nvidia GPU utilization in the cluster. This requires the partitioning of GPU resources. Here, the dimension of GPU resource partitioning refers to the partitioning of GPU memory and Cuda Kernel threads. Generally, cluster-level GPU sharing is mainly about two things:

  1. Scheduling
  2. Isolation
  • Extended resource definition
  • Scheduler extender mechanism
  • Device plugin mechanism

User Scenarios

  • As a cluster administrator, I want to improve the GPU utilization of the cluster. And, during the development, multiple users share the model development environment.
  • As an application developer, I hope to be able to run multiple logic tasks on the Volta GPU at the same time.


  • Users can describe the applying for a shared resource through API and schedule the resource.


  • Isolation of the shared resource is not supported.
  • Overselling is not supported.

Design Principle

  1. Clarify the problem and simplify the design. In the first step, only scheduling and deployment are carried out, and then run-time memory control is implemented.
  1. No intrusive modification
  1. The mode of memory-based scheduling and card-based scheduling can coexist in the cluster. But within the same node, they are mutually exclusive and cannot coexist. In this case, resources are allocated either by the number of cards or by the memory.

Detailed Design


  1. The Extended Resource definition of Kubernetes is still used, but the minimum unit to measure the dimension is changed from 1 GPU card to the MiB of GPU memory. If the GPU used by the node is a single-card with 16 GiB memory, then its corresponding resource is 16276 MiB;
  2. User requirements for shared GPU are typically related to model development and model prediction scenarios. Therefore, in this case, the upper limit of GPU resources applied by the user cannot exceed 1 card, that is, the upper limit of resources applied is a single card.

Core Function Modules:

  • GPU Share Scheduler Extender: It uses the Kubernetes scheduler extension mechanism to determine whether a single GPU card on the node can provide enough GPU Mem when the global scheduler filters and binds, and record the GPU allocation results to the Pod Spec through annotation at the time of binding for the subsequent filtering to check the allocation results.
  • GPU Share Device Plugin: It uses the Device Plugin mechanism, which is called by Kubelet on the node, to allocate the GPU cards and execute based on the allocation result of the Scheduler Extender.

Detailed Process:

1. Resource reporting

  • To find the best GPU card ID in the node according to the binpack policy. The “best” here means that for different GPU cards in the same node, and taking the binpack policy as the determinant condition, the GPU card with the least remaining resources and the free resources satisfying the condition is preferentially selected, and saved as ALIYUN_COM_GPU_MEM_IDX in the annotation of the Pod. In addition, the GPU memory applied by the Pod is also saved as ALIYUN_COM_GPU_MEM_POD and ALIYUN_COM_GPU_MEM_ASSUME_TIME to the annotation of the Pod, and the Pod is bound to the selected node at this time.
  • To call the Kubernetes API to bind the node and the Pod.

Related Projects

Currently, the project has been open sourced on GitHub.


See Deployment Documentation.

Test example

1. First, create an application that uses

apiVersion: apps/v1
kind: Deployment
name: binpack-1
app: binpack-1
replicas: 1
selector: # define how the deployment finds the pods it manages
app: binpack-1
template: # define the pods specifications
app: binpack-1
- name: binpack-1
image: cheyang/gpu-player:v2
# MiB 1024


See Usage Documentation.


See How to Build.

Future Work

  • Isolation with Nvidia MPS
  • Support for this solution can be deployed in the Kubernetes cluster automated by kubeadm
  • High availability for Scheduler Extender
  • A general solution for GPU, RDMA and flexible network cards

Original Source



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: