From Confused to Proficient: Kubernetes Authentication and Scheduling

11 min readDec 24, 2019

By Sheng Dong, Alibaba Cloud After-Sales Technical Expert

It is significant to note that mostly we do not use a command line or visual window to use a system. Currently, we operate clusters but not devices when we use Weibo or shop online.

Generally, such a cluster has hundreds of nodes and each node is a physical machine or a virtual machine. Clusters are generally located in data centers, far away from users and need an operating system to allow its nodes to collaborate with each other and provide consistent and efficient services externally. Kubernetes is one such operating system.

Compared with a single-node operating system, Kubernetes is the kernel. It manages cluster hardware and software resources and provides a central portal for users to use and communicate with clusters.

Programs running on clusters are very different from common programs. They are “in a cage” because they are unusual in terms of production, deployment, and use. You need a deep exploration t understand its essence.

Caged Program

Code

Use the Go language to write a simple web server program app.go, which monitors port 2580. On accessing the root path of the service through HTTP, the service returns the string — “This is a small app for kubernetes…”.

package mainimport (        "github.com/gorilla/mux"        "log"        "net/http")func about(w http.ResponseWriter, r *http.Request) {        w.Write([]byte("This is a small app for kubernetes...\n"))}func main() {        r := mux.NewRouter()        r.HandleFunc("/", about)        log.Fatal(http.ListenAndServe("0.0.0.0:2580", r))}

An executable file is generated for the app when the go build command is executed to compile the program. The file is a common executable file, which runs in the operating system and depends on library files in the system.

# ldd applinux-vdso.so.1 => (0x00007ffd1f7a3000)libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f554fd4a000)libc.so.6 => /lib64/libc.so.6 (0x00007f554f97d000)/lib64/ld-linux-x86-64.so.2 (0x00007f554ff66000)

Cage

To make this program independent from the library files in the operating system, create a container image, that is, an isolated running environment.

Dockerfile is the “menu” used to create the container image. Dockerfile involves the following two steps:

Download the base image of CentOS
Save the executable file of the app to the /usr/local/bin directory of the image.

FROM centosADD app /usr/local/bin

Address

After the created image is stored in a local directory, upload it to the image repository, which is equivalent to an app store. You can use an image repository of Alibaba Cloud. After the upload, the image address changes to the following.

registry.cn-hangzhou.aliyuncs.com/kube-easy/app:latest

An image address is split into four parts, including the repository address, namespace, image name, and image version. In the preceding snapshot, the image repository is in Hangzhou, the namespace is kube-easy, the image name is app, and the image version is the latest.

Then, there is a “caged” program that runs in a Kubernetes cluster.

Get In

Portal

As an operating system, Kubernetes has the concept of APIs, such as common operating systems. With APIs, a cluster has a portal to enable access to the cluster.

The Kubernetes API is implemented as an API server running on a cluster node. An API server is a typical web server program, which provides services by exposing the HTTP or HTTPS interface.

Let’s create an Alibaba Cloud Kubernetes cluster. Once you login to the cluster management page, you see the API server portal on the public network.

API Server 内网连接端点： https://xx.xxx.xxx.xxx:6443

Bidirectional Digital Certificate Verification

The API server of Alibaba Cloud Kubernetes uses CA signature-based bidirectional digital certificate verification to ensure secure communication with the client, which is further explained as follows for beginners.

Conceptually, a digital certificate is a file used to verify network communication participants, which is similar to a graduation diploma issued by a school. The school is a trusted third-party CA while students are the communication participants. If the society trusts the reputation of a school, graduation diplomas issued by the school are also accepted by society. Participant certificates and CA certificates are analogous to the graduation diplomas and school licenses respectively.

Participants include CA and common participants, certificates include CA and participant certificates, and relationships include certificate issuance and trust relationships. The relationships are crucial.

Let’s discuss about issuance relationships first. The following figure shows two CA certificates and three participant certificates. The CA certificate on the top issues two certificates, including the CA certificate in the middle and the participant certificate on the right.

The CA certificate in the middle issues two participant certificates below. The six certificates are associated with issuance relationships, forming a tree-like certificate issuance relationship diagram.

However, certificates and issuance relationships do not ensure trusted communication among participants. As shown in the preceding figure, assume that the rightmost participant is a website and the leftmost participant is a browser. The browser trusts the website data because the browser trusts the top CA certificate (due to the trust relationship) but not because the website has a certificate or the website certificate is issued by the CA.

After understanding the CA certificates, participant certificates, issuance relationships, and trust relationships, let’s discuss about CA signature-based bidirectional digital certificate verification.

The client and API server are common communication participants, with a certificate for each. The two certificates are issued by CAs, named the cluster CA and client CA. The client trusts the cluster CA. Therefore, the client trusts the API server that has a certificate issued by the cluster CA. In turn, the API server needs to trust the client CA before communicating with the client.

In an Alibaba Cloud Kubernetes cluster, the cluster CA certificate and client CA certificate are actually one certificate for implementation. Therefore, a relationship diagram looks like the one shown below.

KubeConfig File

Log on to the cluster management console to obtain the KubeConfig file. The file contains a client certificate and a cluster CA certificate. Since the certificates are Base64-encoded, use Base64 to decode certificates and use OpenSSL to view them.

1. The client certificate issuer CN is the cluster ID c0256a3b8e4b948bb9c21e66b0e1d9a72 while the certificate CN is a RAM user 252771643302762862.

Certificate:    Data:        Version: 3 (0x2)        Serial Number: 787224 (0xc0318)    Signature Algorithm: sha256WithRSAEncryption        Issuer: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72        Validity            Not Before: Nov 29 06:03:00 2018 GMT            Not After : Nov 28 06:08:39 2021 GMT        Subject: O=system:users, OU=, CN=252771643302762862

2. The preceding client certificate passes API server verification only when the API server trusts the client CA certificate. The kube-apiserver process uses the client-ca-file parameter to specify the trusted client CA certificate /etc/kubernetes/pki/apiserver-ca.crt. This file contains two client CA certificates. One is related to cluster management, which is not explained here. The other one has the same CN as the client certificate issuer, as shown in the following figure.

Certificate:    Data:        Version: 3 (0x2)        Serial Number: 787224 (0xc0318)    Signature Algorithm: sha256WithRSAEncryption        Issuer: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72        Validity            Not Before: Nov 29 06:03:00 2018 GMT            Not After : Nov 28 06:08:39 2021 GMT        Subject: O=system:users, OU=, CN=252771643302762862

3. The certificate used by the API server is determined by the tls-cert-file parameter of the kube-apiserver process. This parameter directs to the certificate /etc/kubernetes/pki/apiserver.crt. The CN for this certificate is kube-apiserver, and the issuer is c0256a3b8e4b948bb9c21e66b0e1d9a72, the cluster CA certificate.

Certificate:    Data:        Version: 3 (0x2)        Serial Number: 2184578451551960857 (0x1e512e86fcba3f19)    Signature Algorithm: sha256WithRSAEncryption        Issuer: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72        Validity            Not Before: Nov 29 03:59:00 2018 GMT            Not After : Nov 29 04:14:23 2019 GMT        Subject: CN=kube-apiserver

4. The client needs to verify the preceding API server certificate. Therefore, the KubeConfig file contains its issuer, the cluster CA certificate. Compare the cluster CA certificate and client CA certificate to determine that they are the same.

Certificate:    Data:        Version: 3 (0x2)        Serial Number: 786974 (0xc021e)    Signature Algorithm: sha256WithRSAEncryption        Issuer: C=CN, ST=ZheJiang, L=HangZhou, O=Alibaba, OU=ACS, CN=root        Validity            Not Before: Nov 29 03:59:00 2018 GMT            Not After : Nov 24 04:04:00 2038 GMT        Subject: O=c0256a3b8e4b948bb9c21e66b0e1d9a72, OU=default, CN=c0256a3b8e4b948bb9c21e66b0e1d9a72

Access

After understanding the principle, perform a simple test. Employing the certificate as a parameter, use cURL to access the API server and obtain the expected results.

# curl --cert ./client.crt --cacert ./ca.crt --key ./client.key https://xx.xx.xx.xxx:6443/api/{  "kind": "APIVersions",  "versions": [    "v1"  ],  "serverAddressByClientCIDRs": [    {      "clientCIDR": "0.0.0.0/0",      "serverAddress": "192.168.0.222:6443"    }  ]}

Best Choice

Two Types of Nodes and One Type of Tasks

As mentioned at the beginning, Kubernetes is an operating system that manages multiple nodes in a cluster. The roles of these nodes in the cluster need not be exactly the same. Kubernetes clusters have two types of nodes, including a master node and worker nodes.

The role differentiation is actually responsibility specialization. The master node manages the entire cluster while worker nodes carry common tasks. Cluster management components are the main components running on the master node, including the API server that implements the cluster portal.

In a Kubernetes cluster, a task is defined as a pod. A pod is an atomic unit in a cluster that carries tasks. A pod is translated into a container group as it encapsulates multiple containerized apps. In principle, the containers encapsulated in a pod have a considerable coupling relationship.

Best Choice

The scheduling algorithm needs to rectify the issue by selecting a comfortable “residence” for the pod so that the task defined by the pod completes on this node.

The Kubernetes cluster scheduling algorithm adopts a two-step strategy to achieve “best choice.” The first step is to exclude nodes that do not meet the conditions from all nodes, that is, pre-selection. The second step is to score the remaining nodes. The winner with the highest score is the best choice.

Then use the image created at the beginning of the article to create a pod and use logs to analyze in detail how the pod is scheduled to a cluster node.

Pod Configuration

Firstly, create a pod configuration file in JSON format. The configuration file has three key points, including the image address, command, and container port.

{    "apiVersion": "v1",    "kind": "Pod",    "metadata": {        "name": "app"    },    "spec": {        "containers": [            {                "name": "app",                "image": "registry.cn-hangzhou.aliyuncs.com/kube-easy/app:latest",                "command": [                    "app"                ],                "ports": [                    {                        "containerPort": 2580                    }                ]            }        ]    }}

Log Level

The cluster scheduling algorithm is implemented as a system component running on the master node, which is similar to the API server. The corresponding process name is kube-scheduler and it supports the output of logs at multiple levels. However, the community does not provide detailed log-level instructions. To view the process of filtering and scoring nodes by the scheduling algorithm, increase the log level to 10. Thus, add the parameter — v=10.

kube-scheduler --address=127.0.0.1 --kubeconfig=/etc/kubernetes/scheduler.conf --leader-elect=true --v=10

Pod Creation

Using cURL, the certificate, and the pod configuration file as parameters, send a POST request to access the API server interface and create the corresponding pod in the cluster.

# curl -X POST -H 'Content-Type: application/json;charset=utf-8' --cert ./client.crt --cacert ./ca.crt --key ./client.key https://47.110.197.238:6443/api/v1/namespaces/default/pods -d@app.json

Pre-selection

Pre-selection is the first phase in Kubernetes scheduling, which filters out the nodes that do not meet the conditions according to pre-defined rules. Pre-selection rules implemented by Kubernetes vary greatly with Kubernetes versions. However, the basic trend is that pre-selection rules will be richer and richer.

The two common pre-selection rules are PodFitsResourcesPred and PodFitsHostPortsPred. The former rule determines whether the remaining resources on a node meet the pod requirements. The latter rule checks whether a port on a node has been used by other pods.

The following figure shows the pre-selection rule logs output by the scheduling algorithm when it processes the test pod. The logs record the execution of the CheckVolumeBindingPred pre-selection rule. Persistent volumes (PVs) of some types attach only to one node. This rule filters out the nodes that do not meet the pod requirements on PVs.

The app orchestration file indicates that the pod has no requirement on PVs. Therefore, this condition does not filter out any nodes.

Preference

Preference is the second phase of the scheduling algorithm, where kube-scheduler scores the remaining nodes based on the available resources and other rules of the nodes.

Currently, the CPU and the memory are the two main resources to be evaluated by the scheduling algorithm but the evaluation is not simple. The more the remaining CPU and memory resources, the higher the score.

The logs record two calculation methods, including LeastResourceAllocation and BalancedResourceAllocation. The former method calculates the ratio of the remaining CPU and memory of a node to the total CPU and memory after a pod is scheduled to the node. The higher the ratio, the higher the score.

The latter method calculates the absolute value of the difference between the CPU and memory usage on the node. The larger the absolute value, the lower the score.

The former method tends to select the nodes with lower resource usage, and the latter method selects the nodes with close resource usage. The two methods are slightly contradictory and ultimately balanced by a certain weight.

In addition to resources, the preference algorithm also considers other factors, such as the affinity between pods and nodes, or the dispersion degree of multiple pods on different nodes when a service consists of multiple identical pods, which is a strategy to ensure high availability.

Scores

Finally, the scheduling algorithm multiplies all score items by their weight and then sums the result to get the final score for each node. The test cluster uses the default scheduling algorithm that sets the weight to 1 for the score items in the logs. Therefore, if the scores are calculated based on the score items recorded in the logs, the final scores of the three nodes are 29, 28, and 29.

The scores in the log output are different from those calculated here because the log does not contain all score items. The missing policy may be NodePreferAvoidPodsPriority. The policy has a weight of 10000 and each node has a score of 10. In this case, the final log output is obtained.

Conclusion

This article considers an example of a simple containerized web program to analyze how to use the API server of a Kubernetes cluster to authenticate a client and how to allocate container applications to appropriate nodes.

The article discards some convenient tools, such as kubectl or the console during the analysis and uses some small underlying experiments, such as disassembling the KubeConfig file and analyzing scheduler logs to analyze the operating principles of the authentication and scheduling algorithms. We hope this helps you better understand Kubernetes clusters.

From Confused to Proficient: Kubernetes Authentication and Scheduling

Caged Program

Code

Cage

Address

Get In

Portal

Bidirectional Digital Certificate Verification

KubeConfig File

Access

Best Choice

Two Types of Nodes and One Type of Tasks

Best Choice

Pod Configuration

Log Level

Pod Creation

Pre-selection

Preference

Scores

Conclusion

Original Source:

From Confused to Proficient: Kubernetes Authentication and Scheduling

Alibaba Clouder December 19, 2019 571 By Sheng Dong, Alibaba Cloud After-Sales Technical Expert It is significant to…

Written by Alibaba Cloud

No responses yet