The Burgeoning Kubernetes Scheduling System — Part 3: Binpack Scheduling That Supports Batch Jobs

Preface

The Burgeoning Kubernetes Scheduling System series presents our experiences, technical thinking, and implementation details to Kubernetes users and developers. We hope the articles can help you understand the powerful capabilities and future trends of the Kubernetes scheduling system.

Why Binpack?

The resource scheduling policy of Kubernetes is LeastRequestedPriority by default. The least-consumed nodes are scheduled first, so the resources of the entire cluster are distributed evenly among all nodes. However, this scheduling policy often produces more resource fragments on a single node.

Implementation Method

Constructing Scoring Function

The process of the constructing scoring function is simple. Users can define the score corresponding to different utilization rates to determine the decision-making process of scheduling.

Scoring

Users can define referenced resources and weight values in the Binpack calculation. For example, they can set the values and weights of the GPU and CPU.

resourcetoweightmap: 
"cpu": 1
"nvidia.com/gpu": 1
Score = line(resource1_utilization) * weight1 + line(resource2_utilization) * weight2 ....) / (weight1 + weight2 ....)

Binpack Usage

Configuration

1. Create a/etc/kubernetes/scheduler-config.yaml. Users are allowed to configure other priorities policies according to their needs.

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
clientConnection:
kubeconfig: "REPLACE_ME_WITH_KUBE_CONFIG_PATH"
plugins:
score:
enabled:
- name: RequestedToCapacityRatio
weight: 100
disabled:
- name: LeastRequestedPriority
pluginConfig:
- name: RequestedToCapacityRatio
args:
functionshape:
- utilization: 0
score: 0
- utilization: 100
score: 100
resourcetoweightmap: # Define the type of resources for Binpack operations. Users can set weight for multiple resources.
"cpu": 1
"nvidia.com/gpu": 1

Demo Example

This is the result demonstration of Binpack after running the distributed work of TensorFlow. There are two 4-card GPU machines in the tested cluster.

git clone https://github.com/kubeflow/arena.git
kubectl create ns arena-system
kubectl create -f arena/kubernetes-artifacts/jobmon/jobmon-role.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-crd.yaml
kubectl create -f arena/kubernetes-artifacts/tf-operator/tf-operator.yaml
$ kubectl  get pods -n arena-system
NAME READY STATUS RESTARTS AGE
tf-job-dashboard-56cf48874f-gwlhv 1/1 Running 0 54s
tf-job-operator-66494d88fd-snm9m 1/1 Running 0 54s
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: '1'
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 4
template:
metadata:
creationTimestamp: null
labels:
pod-group.scheduling.sigs.k8s.io/name: tf-smoke-gpu
pod-group.scheduling.sigs.k8s.io/min-available: "5"
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: registry.cn-hangzhou.aliyuncs.com/kubeflow-images-public/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 1
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
$ kubectl get pods -o wide
NAME READY STATUS AGE IP NODE
tf-smoke-gpu-ps-0 1/1 Running 15s 172.20.0.210 cn-shanghai.192.168.0.129
tf-smoke-gpu-worker-0 1/1 Running 17s 172.20.0.206 cn-shanghai.192.168.0.129
tf-smoke-gpu-worker-1 1/1 Running 17s 172.20.0.207 cn-shanghai.192.168.0.129
tf-smoke-gpu-worker-2 1/1 Running 17s 172.20.0.209 cn-shanghai.192.168.0.129
tf-smoke-gpu-worker-3 1/1 Running 17s 172.20.0.208 cn-shanghai.192.168.0.129
$ kubectl get pods -o wide
NAME READY STATUS AGE IP NODE
tf-smoke-gpu-ps-0 1/1 Running 7s 172.20.1.72 cn-shanghai.192.168.0.130
tf-smoke-gpu-worker-0 1/1 Running 8s 172.20.0.214 cn-shanghai.192.168.0.129
tf-smoke-gpu-worker-1 1/1 Running 8s 172.20.1.70 cn-shanghai.192.168.0.130
tf-smoke-gpu-worker-2 1/1 Running 8s 172.20.0.215 cn-shanghai.192.168.0.129
tf-smoke-gpu-worker-3 1/1 Running 8s 172.20.1.71 cn-shanghai.192.168.0.130

Summary

This article introduces how to use the native scheduling policy of Kubernetes (called RequestedToCapacityRatio) to enable Binpack Scheduling. By doing so, resource fragments can be reduced, and the GPU utilization rate can be improved. It is efficient, clear, and simple to use. The following articles of this series will discuss improving GPU resource utilization. We will focus on using GPU sharing scheduling to improve the GPU utilization rate.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store