GPU Sharing Scheduler Extender Now Supports Fine-Grained Kubernetes Clusters
By Bi Ran
Kubernetes services of major container cluster service vendors around the world all provide the capability to schedule Nvidia GPU containers, but it is generally implemented by allocating a GPU card to a container. This allows better isolation and ensures that applications using GPU are not affected by other applications. It is suitable for deep learning model training scenarios, but it would be a waste for model development and model prediction scenarios. The demand is to allow more prediction services to share the same GPU card, thus improving Nvidia GPU utilization in the cluster. This requires the partitioning of GPU resources. Here, the dimension of GPU resource partitioning refers to the partitioning of GPU memory and Cuda Kernel threads. Generally, cluster-level GPU sharing is mainly about two things:
This article mainly describes scheduling. The isolation solution will be implemented based on Nvidia MPS in the future.
For fine-grained GPU card scheduling, the Kubernetes community does not currently have a good solution. This is because the Kubernetes definition of extended resources, such as GPUs, only supports the addition and subtraction of integer granularity, but cannot support the allocation of complex resources. For example, if you want to use Pod A to occupy half of the GPU card, the recording and calling of resource allocation cannot be implemented in the current Kubernetes architecture design. Here, Multi-Card GPU Share relates to actual vector resources, while the Extended Resource describes scalar resources.
Therefore, we have designed an out-of-tree GPU Share Scheduling Solution, which relies on the existing working mechanism of Kubernetes:
- Extended resource definition
- Scheduler extender mechanism
- Device plugin mechanism
- As a cluster administrator, I want to improve the GPU utilization of the cluster. And, during the development, multiple users share the model development environment.
- As an application developer, I hope to be able to run multiple logic tasks on the Volta GPU at the same time.
- Users can describe the applying for a shared resource through API and schedule the resource.
- Isolation of the shared resource is not supported.
- Overselling is not supported.
- Clarify the problem and simplify the design. In the first step, only scheduling and deployment are carried out, and then run-time memory control is implemented.
Many customers have a clear requirement to allow multi-AI applications to be scheduled to the same GPU. They can accept controlling the size of memory from the application level, and use
gpu_options.per_process_gpu_memory_fraction to control the memory usage of the application. The first problem we need to solve is to simplify, using memory as the scheduling scale, and transfer the size of the memory to the container in the form of parameters.
- No intrusive modification
In this design, the core of the following designs of Kubernetes is not modified: the design of the Extended Resource, the implementation of the Scheduler, the mechanism of the Device Plugin, and the related design of the Kubelet. Reusing the Extended Resource to describe the application API for shared resources. The advantage is to provide a portable solution that users can use on the native Kubernetes.
- The mode of memory-based scheduling and card-based scheduling can coexist in the cluster. But within the same node, they are mutually exclusive and cannot coexist. In this case, resources are allocated either by the number of cards or by the memory.
- The Extended Resource definition of Kubernetes is still used, but the minimum unit to measure the dimension is changed from 1 GPU card to the MiB of GPU memory. If the GPU used by the node is a single-card with 16 GiB memory, then its corresponding resource is 16276 MiB;
- User requirements for shared GPU are typically related to model development and model prediction scenarios. Therefore, in this case, the upper limit of GPU resources applied by the user cannot exceed 1 card, that is, the upper limit of resources applied is a single card.
First, our task is to define two new Extended Resources: the first is gpu-mem, corresponding to the GPU memory, and the second is gpu-count, corresponding to the number of GPU cards. Vector resources are described by these two scalar resources, and the vector resources are combined to provide a mechanism to support GPU Share. The basic architecture diagram is as follows:
Core Function Modules:
- GPU Share Scheduler Extender: It uses the Kubernetes scheduler extension mechanism to determine whether a single GPU card on the node can provide enough GPU Mem when the global scheduler filters and binds, and record the GPU allocation results to the Pod Spec through annotation at the time of binding for the subsequent filtering to check the allocation results.
- GPU Share Device Plugin: It uses the Device Plugin mechanism, which is called by Kubelet on the node, to allocate the GPU cards and execute based on the allocation result of the Scheduler Extender.
1. Resource reporting
GPU Share Device Plugin uses the NVML library to query the number of GPU cards and the memory of each GPU card, and uses
ListAndWatch() to report the total memory (quantity memory) of GPUs on the node as an additional Extended Resource to Kubelet. Then, Kubelet reports it to Kubernetes API Server. For example, if a node contains two GPU cards and each card contains 16276 MiB, then from the user's perspective, the GPU resources of the node are 16276 2 = 32552, and the number of GPU cards on the node, which is 2, is also reported as an additional Extended Resource.
2. Extended scheduling
GPU Share Scheduler Extender can reserve the allocation information in the Pod Spec in the form of annotations while allocating gpu-mem to the Pod, and can determine whether each GPU card contains enough available gpu-mem allocation at the time of filtering based on this information.
2.1. The default Kubernetes scheduler calls the Filter method of the GPU Share Scheduler Extender over http after all the filter actions have been performed. This is because the default scheduler can only determine whether free resources are available that can meet the demand on the whole, and cannot specifically determine whether the demand is met on a single card when computing the Extended Resources. Therefore, it is up to the GPU Share Scheduler Extender to check whether a single card contains available resources.
The following figure is used as an example. In a Kubernetes cluster composed of 3 nodes that contain 2 GPU cards, when a user applies for
gpu-mem = 8138, the default scheduler scans all nodes and finds that the remaining resources of N1 is 16276 * 2 - 16276 -12207 = 4069, which does not meet the resource demands, so N1 node are filtered out.
The remaining resources of N2 and N3 nodes are both 8138 MiB, which meets the conditions of the default scheduler from the perspective of overall scheduling. At this time, the default scheduler entrusts the GPU Share Scheduler Extender to perform secondary filtering. In the secondary filtering, the GPU Share Scheduler Extender needs to determine whether the single card meets the scheduling requirements. For N2 node, it is found that the node has 8138 MiB of available resources, but from the perspective of each GPU card, GPU0 and GPU1 have only 4069 MiB of available resources, which cannot meet the demand of 8138 MiB of a single card. N3 Node also has a total of 8138 MiB available resources, but these available resources all belong to GPU0, meeting the demand of single-card scheduling. As a result, precise conditional filtering can be implemented through the filtering of the GPU Share Scheduler Extender.
2.2. When the scheduler finds a node that meets the condition, it entrusts the bind method of the GPU Share Scheduler Extender to bind the node and the Pod. Here, the Extender needs to perform two operations:
- To find the best GPU card ID in the node according to the binpack policy. The “best” here means that for different GPU cards in the same node, and taking the binpack policy as the determinant condition, the GPU card with the least remaining resources and the free resources satisfying the condition is preferentially selected, and saved as
ALIYUN_COM_GPU_MEM_IDXin the annotation of the Pod. In addition, the GPU memory applied by the Pod is also saved as
ALIYUN_COM_GPU_MEM_ASSUME_TIMEto the annotation of the Pod, and the Pod is bound to the selected node at this time.
Note: The Pod annotation for
ALIYUN_COM_GPU_MEM_ASSIGNED is also saved and initialized to "false." It means that the Pod is assigned to a GPU card during scheduling, but the Pod is not actually created on the node.
ALIYUN_COM_GPU_MEM_ASSUME_TIME represents the time
If no GPU resources on the allocated node meet the condition, the scheduler does not perform binding at this time and exits directly without reporting an error. The default scheduler will reschedule after ASSUME times out.
- To call the Kubernetes API to bind the node and the Pod.
As shown in the following figure, when GPU Share Scheduler Extender binds the Pod with gpu-mem 8138 to the selected node N1, it first compares the available resources of different GPUs, which are GPU0 (12207), GPU1 (8138), GPU2 (4069) and GPU3 (16276). GPU2 are discarded because its remaining resources do not meet the requirements. Among the other 3 GPUs that meet the condition, GPU1 is the GPU card with the least resources left, and the free resources satisfy the condition, so GPU1 is selected.
3. Run on the node
When the event that the Pod is bound to the node is received by Kubelet, Kubelet creates a real Pod entity on the node. In this process, Kubelet calls the
Allocate method of the GPU Share Device Plugin, and the parameter of the
Allocate method is gpu-mem applied by the Pod. In the
Allocate method, the corresponding Pod is run according to the scheduling decision of the GPU Share Scheduler Extender.
3.1. All the GPU Share Pods in this node with Pending status and
ALIYUN_COM_GPU_MEM_ASSIGNED set to
3.2. The Pod with the same number of
ALIYUN_COM_GPU_MEM_POD (in the Pod Annotation) and Allocate applications is selected. If multiple Pods meet the condition, the POD with the earliest
ALIYUN_COM_GPU_MEM_ASSUME_TIME is selected.
ALIYUN_COM_GPU_MEM_ASSIGNED in the Pod Annotation is set to true, and the GPU information in the Pod Annotation is converted into an environment variable and returned to Kubelet to truly create the Pod.
Currently, the project has been open sourced on GitHub.
1. First, create an application that uses
replicas: 1 selector: # define how the deployment finds the pods it manages
app: binpack-1 template: # define the pods specifications
app: binpack-1 spec:
- name: binpack-1
See Usage Documentation.
See How to Build.
- Isolation with Nvidia MPS
- Support for this solution can be deployed in the Kubernetes cluster automated by kubeadm
- High availability for Scheduler Extender
- A general solution for GPU, RDMA and flexible network cards