Advance Deep Learning with Alibaba Open-Source and Pluggable Scheduling Tool for GPU Sharing

Cluster Scheduling: Kubernetes GPU Sharing

  • Extended resource definition
  • Scheduler extender mechanism
  • Device plugin mechanism
  • Kubectl extension mechanism

User Scenarios

  • A cluster administrator: “I want to improve the GPU utilization of the cluster. During the development, multiple users share the model development environment.”
  • An application developer: “I hope to be able to run multiple logic tasks on the Volta GPU at the same time.”

Goal

  • Users can describe the applying for a shared resource through API and schedule the resource.

Non-Goal

  • Isolation of the shared resource is not supported.
  • Overselling is not supported.

Design Principle

Detailed Design

Premise:

Core Function Modules:

  • GPU Share Scheduler Extender: It uses the Kubernetes scheduler extension mechanism to determine whether a single GPU card on the node can provide enough GPU Mem when the global scheduler filters and binds, and record the GPU allocation results to the Pod Spec through the annotation at the time of binding for the subsequent filtering to check the allocation results.
  • GPU Share Device Plugin: It uses the Device Plugin mechanism, which is called by Kubelet on the node, to allocate the GPU cards and execute based on the allocation result of the Scheduler Extender.

Detailed Process:

  • To find the best GPU card ID in the node according to the binpack policy. The “best” here means that for different GPU cards in the same node, and taking the binpack policy as the determinant condition, the GPU card with the least remaining resources and the free resources satisfying the condition is preferentially selected, and saved as ALIYUN_COM_GPU_MEM_IDX in the annotation of the Pod. In addition, the GPU memory applied by the Pod is also saved as ALIYUN_COM_GPU_MEM_Pod and ALIYUN_COM_GPU_MEM_ASSUME_TIME to the annotation of the Pod, and the POD is bound to the selected node at this time.
  • To call the Kubernetes API to perform the binding of the node and the Pod.
  • All the GPU Share Pods in this node with Pending status and ALIYUN_COM_GPU_MEM_ASSIGNED to false are listed.
  • The Pod with the same number of ALIYUN_COM_GPU_MEM_POD (in the Pod Annotation) and Allocate applications is selected. If multiple Pods meet the condition, the POD with the earliest ALIYUN_COM_GPU_MEM_ASSUME_TIME is selected.
  • The ALIYUN_COM_GPU_MEM_ASSIGNED in the Pod Annotation is set to true, and the GPU information in the Pod Annotation is converted into an environment variable and returned to Kubelet to truly create the Pod.

Related Projects

Deploy

Test Sample

apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 1
selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-1
template: # define the pods specifications
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# MiB
aliyun.com/gpu-mem: 1024

Use

Build

Roadmap

  • Optional support for Nvidia MPS is available in the Device Plugin;
  • The solution can be deployed automatically in the Kubernetes cluster initiated by kubeadm;
  • Scheduler Extener availability is improved;
  • A general solution for GPU, RDMA and flexible network cards is provided.

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Breaking the Records: Hackerrank challenge.

Light control from bot deploy on Heroku.

Serverless: Background, Challenges and Future

N Methods for Migrating Data to MaxCompute

Troubleshooting Elixir crashes

Best Practices for designing & developing Data Pipelines using Apache Airflow

CS50-Memory

MongoShake — A MongoDB-based Cross-Data Center Data Replication Platform

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Why is Containerization Taking Over?

Prometheus for Monitoring System Performance Metrics

Kubernetes Architecture — Processes run on Master Node

You are not alone — the frustrating (failure) journey setting up private docker registry with self…