Getting Started with Kubernetes | Understanding Kubernetes RuntimeClass and Using Multiple Container Runtimes
By Jiazhuo, Senior Development Engineer at Alibaba Cloud
Source of RuntimeClass Requirements
Evolution of Container Runtimes
The evolution of container runtimes can be divided into three phases:
Phase 1: June 2014
Kubernetes was officially made open source, and Docker was the default and only container runtime at the time.
Phase 2: Kubernetes 1.3
rkt was integrated into the Kubernetes backbone and became the second container runtime.
Phase 3: Kubernetes 1.5
An increasing number of container runtimes want to connect to Kubernetes. It will be difficult to maintain the Kubernetes code and guarantee the code quality if more container runtimes are built into Kubernetes in the same way as rkt and Docker.
To solve this problem, the community introduced the container runtime interface (CRI) to Kubernetes 1.5. The CRI decouples container runtimes from Kubernetes, saving developers the trouble of adapting to all types of container runtimes in the community and the worry about version maintenance due to inconsistent iteration cycles between container runtimes and Kubernetes. For example, the CRI plug-in of containerd allows container runtimes such as the CRI, Kata Containers, and gVisor to directly connect to containerd.
As more container runtimes emerge, different container runtimes are used in different scenarios, resulting in the need to run multiple container runtimes. Before running multiple container runtimes, you must be able to answer the following questions:
- What container runtimes are available in the cluster?
- How do I select an appropriate container runtime for a pod?
- How do I schedule a pod to a node that has a specific container runtime configured?
- Extra overhead other than the service operation overhead is incurred when container runtimes run containers. How do I measure the extra overhead?
To answer the preceding questions, the community introduced RuntimeClass. RuntimeClass was initially introduced in Kubernetes 1.12 in the form of CustomResourceDefinitions (CRDs). After Kubernetes 1.14, RuntimeClass was introduced again as a built-in cluster resource. Kubernetes 1.16 extends the scheduling capability based on Kubernetes 1.14 and reduces overhead.
The following describes the workflow of RuntimeClass in Kubernetes 1.16. The preceding figure shows the RuntimeClass workflow on the left and a YAML file on the right.
The YAML file consists of two parts. The first part is used to create a RuntimeClass object named runv. The second part is used to create a pod that references the RuntimeClass named runv through spec.runtimeClassName.
The RuntimeClass object has a handler at its core. The handler indicates the program that receives the container creation request and corresponds to a container runtime. For example, assume the containers in the pod are created by the container runtime runv. The Scheduling field indicates the node to which the pod is scheduled to.
The left part of the preceding figure shows the RuntimeClass workflow as follows:
- The Kubernetes master receives a pod creation request.
- Three types of nodes are available. Each node has a label to indicate the supported container runtimes. Each node has one or more handlers, each of which corresponds to a container runtime. For example, the second node includes the handlers that support the runc and runv container runtimes, respectively. The third node includes the handler that supports the runhcs container runtime.
- According to the scheduling.nodeSelector field, the pod is scheduled to the second node, and a pod is created by the runv handler.
Functions of RuntimeClass
Definition of the RuntimeClass Structure
Here, we will use the RuntimeClass in Kubernetes 1.16 as an example. The RuntimeClass structure is defined as follows:
A RuntimeClass object represents a container runtime. Its structure includes the Handler, Overhead, and Scheduling fields.
- The Handler field indicates the program that receives the pod creation request. A handler corresponds to a container runtime.
- The Overhead field was introduced in Kubernetes 1.16 and indicates the extra overhead other than the resources required to run the pod business.
- The Scheduling field was introduced in Kubernetes 1.16. Its setting is automatically injected into the pod’s nodeSelector.
Example of RuntimeClass Resource Definition
RuntimeClass can be referenced in a pod by setting a RuntimeClass name in the runtimeClassName field.
Definition of the Scheduling Structure
The Scheduling field is related to the scheduling of the pod that references the RuntimeClass object.
The Scheduling field consists of two fields: NodeSelector and Tolerations. These two fields are similar to NodeSelector and Tolerations for a pod.
NodeSelector provides a list of labels that indicate a node supports a certain RuntimeClass. After a pod references the RuntimeClass, the RuntimeClass admission merges the label lists of the node and the pod. RuntimeClass admission denies the labels if two labels have the same key but different values. The RuntimeClass does not automatically set labels for nodes. You need to set labels in advance.
Tolerations provide the toleration list of the RuntimeClass. After a pod references the RuntimeClass, the RuntimeClass admission merges the toleration lists of the pod and the RuntimeClass. If the two lists have the same toleration configuration, they are merged into one list.
Why Was the Pod Overhead Field Introduced?
The left part of the preceding figure shows a Docker pod, and the right part shows a Kata pod. The Docker pod includes a conventional container and a pause container. The pause container is excluded from the overhead calculation of the pod. The overhead of the Kata pod only includes the container overhead. The overheads of the Kata agent, pause container, and guest kernel are not calculated. These overheads may reach 100 MB, so they cannot be ignored.
This is why the Pod Overhead field was introduced. Its structure is defined as follows:
Its definition is simple and only contains one field: PodFixed. This field has a key-value mapping, in which the key indicates a resource name and the value indicates a quantity. Each quantity indicates the utilization of a resource. Therefore, PodFixed indicates the utilization of various resources. You can set PodFixed to specify CPU utilization and memory usage.
Scenarios and Limits of the Pod Overhead Field
The Pod Overhead field is used in three scenarios:
Before overhead was introduced, a pod could be scheduled to a node if the available resources of the node were no less than the amount requested by the pod. After the Pod Overhead field was introduced, a pod can be scheduled to a node only when the available resources of the node are no less than the amount requested by the pod plus the pod overhead.
A resource quota places a limit on the resources that can be used in a namespace. For example, you have a namespace with 1 GB memory usage and a pod with a requests value of 500. A maximum of two such pods can be scheduled in the namespace. If you add 200 MB overhead to each of the two pods, a maximum of one such pod can be scheduled in this namespace.
Kubelet pod eviction
After the Overhead field was introduced, overhead was included in a node’s resource usage. This increases the proportion of used resources and affects Kubelet pod eviction. Next, let’s look at the limits and precautions of using the Pod Overhead field.
The Pod Overhead field is permanently injected into the pod and cannot be manually modified. The Pod Overhead field persists and remains valid even when the RuntimeClass is deleted or updated.
Currently, the Pod Overhead field can be automatically injected only by the RuntimeClass admission, and cannot be manually added or modified. Any manual actions are denied.
The Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) aggregate container metrics and are not affected by the Pod Overhead field.
Example of Running Multiple Container Runtimes
Alibaba Cloud’s ACK security sandbox container can run multiple container runtimes. The following describes how to run multiple container runtimes in the environment shown in the preceding figure.
As shown in the preceding figure, the pod on the left has the container runtime runc, which corresponds to the RuntimeClass runc. The pod on the right has the container runtime runv, which references the RuntimeClass runv. Requests are highlighted in different colors. Blue requests are runc requests, and red requests are runv requests. In the lower part of the figure, the core component is containerd. Multiple container runtimes can be configured in containerd, which forwards requests.
When receiving a runc request, the Kubernetes API server forwards the request to Kubelet, which forwards the request to the CRI plug-in. The CRI plug-in queries the containerd configuration file to identify the handler for runc and determines that containerd-shim is requested through shim API runtime v1. Then, a pod is created by containerd-shim. This is the workflow for runc.
The workflow for runv is similar. A request is forwarded by the Kubernetes API server, Kubelet, and the CRI plug-in in sequence. The CRI plug-in queries the containerd configuration file to determine that containerd-shim-kata-v2 was created through shim API runtime v2. Then, a Kata pod is created by containerd-shim-kata-v2.
Now, let’s look at the containerd configuration.
By default, containerd is stored in
file:///etc/containerd/config.toml. The core configuration is stored in the plugins.cri.containerd directory. Each runtimes configuration item has the same prefix in its name, that is, plugins.cri.containerd.runtimes. The name is suffixed with runc or runv, which is a RuntimeClass. runc and runv correspond to the handler names in the aforementioned RuntimeClass object. The configuration item plugins.cri.containerd.runtimes.default_runtime indicates that a pod that does not specify a RuntimeClass but is scheduled to the current node uses the container runtime runc by default.
The following example creates two RuntimeClass objects: runc and runv. You can view all available container runtimes by using kubectl get runtimeclass.
The following figure shows how to create a runc pod and a runv pod. Pay special attention to the runtimeClassName field, which references the container runtimes runc and runv for the two pods.
After pods are created, run the kubectl command to view the container running state of the pods and the container runtimes used by the pods. The cluster includes two pods: runc pod and runv pod. One pod references the RuntimeClass runc, and the other pod references the RuntimeClass runv. Both pods are in the running state.
Let’s summarize what we have learned in this article.
- RuntimeClass is a built-in cluster resource of Kubernetes that is used to run multiple container runtimes.
- You can set the Scheduling field in RuntimeClass to automatically schedule a pod to the node with a specified container runtime. For automatic scheduling, you need to configure labels for this node in advance.
- You can set the Overhead field in RuntimeClass to calculate the overhead that is incurred beyond the scope of the pod business. This will give you a better understanding of scheduling, resource quotas, and Kubelet pod eviction.