Getting Started with Kubernetes | Observability: Monitoring and Logging

1) Background

2) Monitoring

Monitoring Types

Monitoring Evolution in Kubernetes

  • Summary API
  • Kubelet API
  • Prometheus API
  • The customer requirements are ever-changing. For example, you may want to collect basic data by using Heapster today, but tomorrow, you may want to develop a data API to expose the number of online users within the application. In addition, you may want to add the data API to your own API system to display the monitoring data and consume the data in the same way as the HPA controller does. Is it possible to implement this scenario in Heapster? The answer is no. This indicates that the scalability of Heapster is poor.
  • The second reason is that Heapster provides a lot of sinks to cache data. These sinks include InfluxDB, Simple Log Service (SLS), and DingTalk, which are mainly responsible for collecting and caching the data. Many customers use InfluxDB to cache the data and connect InfluxDB to monitoring data visualization software such as Grafana, to display the monitoring data.

Data Monitoring API Standards in Kubernetes

Prometheus — A Monitoring Standard for the Open-source Community

  • First, Prometheus is a graduate project of the Cloud Native Computing Foundation (CNCF). Currently, an increasing number of open-source projects use Prometheus as the monitoring standard. Actually, common programs such as Spark, TensorFlow, and Flink have their respective standard Prometheus collection APIs.
  • Second, common programs such as databases and middleware have corresponding Prometheus collection clients, and programs such as etcd, ZooKeeper, MySQL, or PostgreSQL have their corresponding Prometheus APIs. If no Prometheus API is available, the community provides a corresponding exporter to implement the API.
  • The first is the push model. In this model, data is collected and cached through Pushgateway, and then Prometheus pulls the data from Pushgateway. This collection method mainly addresses the issues in scenarios where the task lasts for a short time. Specifically, in the pull model, which is known as the most common collection method in Prometheus, some data may fail to be collected if the data declaration cycle is shorter than the data collection cycle. For example, if the collection cycle is 30 seconds whereas the task runs for 15 seconds only, then some data may fail to be collected. To address this issue, the simplest solution is to send the data to Pushgateway to push your metrics and then pull the data from Pushgateway. In this way, transient tasks are fully run without losing any jobs.
  • The second is the standard pull model. In this model, data is directly pulled from corresponding tasks.
  • The third is the Prometheus on the Prometheus model. In this model, data is synchronized from one Prometheus instance to another Prometheus instance.
  • A simple and powerful access standard. To collect data, developers only need to implement the Prometheus client as the API standard.
  • Various data collection and caching methods. Prometheus allows you to collect and cache data in pushing, pulling, and Prometheus on Prometheus modes.
  • Compatible with Kubernetes.
  • Extensive plug-in mechanisms and ecosystems.
  • The Prometheus operator. The Prometheus operator is probably the most complex operator we have seen. However, the Prometheus operator is also an operator that can exploit the scalability of Prometheus. If you are using Prometheus in Kubernetes, we recommend that you use the Prometheus operator to deploy and maintain Kubernetes.

Kube-eventer — Kubernetes Event Caching Tool

3) Logging

Logging Scenarios

  • Kernel logs of the host help developers locate and diagnose common issues. The first common issue is network stack exceptions. When such an exception occurs, messages about the controller table may appear in iptables marks.
  • The second common issue is drive exceptions. They are common in some network solutions or GPU-related scenarios.
  • The third common issue is file system exceptions. Actually, at the early stage when Docker was immature, these exceptions frequently occurred in the overlay file system (OverlayFS) or advanced multi-layered unification filesystem (AUFS). At that time, there was no way for developers to monitor and diagnose these problems. Currently, such exceptions can be spotted directly in the kernel logs of the host.
  • The fourth common issue is some node exceptions. For example, kernel panics or some out of memory (OOM) problems in the kernel are also reflected in the kernel logs of the host.

Log Collection

  • The first method is to collect logs from host files. Commonly, a container’s component such as the volume writes log files to the host. Then, the log files are rotated based on the log rotation policy of the host. Finally, the agent on the host collects the logs.
  • The second method is to collect log files from within containers. However, what is the common way to deal with these files? Commonly, a sidecar streaming container writes the log files to stdout, which then writes the log files to the corresponding log-file. Then, the log files are rotated locally. Finally, an external agent collects the logs.
  • The third method is to directly write logs to stdout. This is a common logging policy. The remote end directly collects logs either by using the agent or calling standard APIs such as some serverless APIs.

4) Summary

Monitoring System in Alibaba Cloud Container Service

Features Enhanced by Alibaba Cloud

Logging System in Alibaba Cloud Container Service


Original Source:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: