Log Platform Solution in the Cloud Native Architecture
By Ford, FDD architect, with 12 years of software development experience, mainly responsible for the design of cloud native architecture with focus on infrastructures, Service Mesh advocate and practitioner of continuous delivery and agility.
Characteristics of the Log System in the Cloud Native Architecture
In recent years, many businesses have experienced unexpected growth, which has increased the pressure on traditional software architecture. To cope with this, many have started adopting microservices for their software architecture. Consequently, the number of online applications has multiplied after horizontal and vertical expansion. In traditional monolithic application scenarios, the methods of log query and analysis by using tail, grep, and awk commands cannot meet new requirements. In addition, these methods cannot cope with the huge increase of application logs and the complex operating environment of distributed projects in cloud native architecture.
During the transformation of the cloud native architecture, observability has become the enterprise-level issue for quick fault locating and diagnosis, under complex dynamic cluster environments. Logs are particularly important as one of the three major elements that can be monitored. The three elements are logs, metrics, and traces. The services of log system are no longer limited to application system diagnosis. Now business, operation, BI, audit, and security are also included. The ultimate goal of the log platform is to achieve digitization and intelligence of all aspects in the cloud native architecture.
The log solutions based on cloud native architectures are quite different from that based on physical machines and virtual machines. For example:
- Dynamic log environment: In the Kubernetes cluster environment, it is normal of the automatic scaling of applications, the destruction and drift of pods, and the enabling and disabling of working nodes. In these cases, logs are transient and will be destroyed passively with the destruction and drift of pods. Therefore, log data needs to be collected into concentrated storage devices in real time. Besides, new requirements on the scalability and adaptability of log collector in complex dynamic environments are proposed.
- Resource consumption: In the traditional ELK architecture, JDK-based Logstash and Filebeat usually consume about 500 MB and 12 MB of memory respectively. In the microservice and cloud native architectures, services are usually split to small fractions, so there should be less resource consumptions on services during data collection.
- The O&M cost of a log platform: It is complex and cumbersome to operate and maintain log collection and management platforms in a set of dynamic environments. The log platform should adopt SaaS as the underlying infrastructure for one-click deployment and dynamic adaptation.
- Convenient log analysis: The core function of a log system is troubleshooting. The speed of troubleshooting directly determines the accident response speed and the losses. A set of visual, high-performance, and intelligent analysis functions can help users quickly locate problems.
Log System Design in the Cloud Native Architecture
Solution Selection
Log collection solutions in the cloud native architecture
Based on the above advantages and disadvantages, we choose Solution 3. Solution 3 is selected because it balances scalability, resource consumption, deployment, and maintenance.
The following figures show the architectures of each solution.
Solution Implementation and Verification
Solution Description
When a cluster starts, a Fluent-bit agent is started on each machine in DaemonSet mode to collect logs and send them to Elasticsearch. Each agent mounts the directory /var/log/containers/. Then, the agent uses the tail plug-in of Fluent-bit to scan log files of each container and directly send these logs to Elasticsearch.
The log of /var/log/containers/ is mapped from the container log of a Kubernetes node, as shown in the following figures:
Monitoring of Fluent-bit and Input configuration
@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE output-elasticsearch.conf input-kubernetes.conf: |
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
The collection agent is deployed based on Kubernetes cluster. When nodes in the cluster are scaled out, Fluent-bit agents of new nodes are automatically deployed by kube-scheduler.
The current services of Elasticsearch and Kibana are provided by cloud vendors. The services provide the X-pack plug-in and support permission management feature which is only available in Business Edition.
Implementation
1. Configure Fluent-bit collectors, including collectors for server, input, filters, and output.
2. Create RBAC permission of Fluent-bit in the Kubernetes cluster.
- fluent-bit-service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: logging
- fluent-bit-role.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: fluent-bit-read
rules:
- apiGroups: [""]
resources:
- namespaces
- pods
verbs: ["get", "list", "watch"]
- fluent-bit-role-binding.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: fluent-bit-read
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit-read
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: logging
3. Deploy Fluent-bit on cluster nodes of Kubernetes in DaemonSet mode.
- fluent-bit-ds.yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "2020"
prometheus.io/path: /api/v1/metrics/prometheus
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:1.2.1
imagePullPolicy: Always
ports:
- containerPort: 2020
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"
Log Platform Verification
- Simple Query
- Word Segmentation Query
- Precise Query
- Compound Query (AND, OR)
- Field-based Query
- Project Filtering
- Machine and Node Filtering
- Regular Query
- Interval Query
- Context Query
- Customized Display of Log List
- Time Selection
Collection of Cluster Audit Log
In the solution, Fluent-bit collects event audit logs of Kubernetes clusters and generates corresponding logs for status changes caused by kube-apiserver operations. The following kubernetes-audit-policy.yaml defines which audit logs are collected. To do that, the reference of this configuration in kube-api startup file is needed by using --audit-policy-file
.
- kubernetes-audit-policy.yaml
apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
# Don't generate audit events for all requests in RequestReceived stage.
omitStages:
- "RequestReceived"
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
resources:
- group: ""
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["pods"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
resources:
- group: ""
resources: ["pods/log", "pods/status"] # Don't log requests to a configmap called "controller-leader"
- level: None
resources:
- group: ""
resources: ["configmaps"]
resourceNames: ["controller-leader"] # Don't log watch requests by the "system:kube-proxy" on endpoints or services
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core API group
resources: ["endpoints", "services"] # Don't log authenticated requests to certain non-resource URL paths.
- level: None
userGroups: ["system:authenticated"]
nonResourceURLs:
- "/api*" # Wildcard matching.
- "/version" # Log the request body of configmap changes in kube-system.
- level: Request
resources:
- group: "" # core API group
resources: ["configmaps"]
# This rule only applies to resources in the "kube-system" namespace.
# The empty string "" can be used to select non-namespaced resources.
namespaces: ["kube-system"] # Log configmap and secret changes in all other namespaces at the Metadata level.
- level: Metadata
resources:
- group: "" # core API group
resources: ["secrets", "configmaps"] # Log all other resources in core and extensions at the Request level.
- level: Request
resources:
- group: "" # core API group
- group: "extensions" # Version of group should NOT be included. # A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
Summary
As the distributed system in the cloud-native architecture grows complex, logs are becoming scattered. So, it is difficult to monitor application and troubleshoot, and the efficiency is low. The centralized log platform of Kubernetes cluster in this article aims to solve these problems. The collection, retrieval, and analysis of cluster logs, application logs, and security logs, and Web management are centrally controlled by the platform. It realizes quick troubleshooting, and become an important way to solve problems efficiently.
During production and deployment, the introduction of Kafka queue can be determined based on the business system capacity. In the offline environment, it doesn’t have to introduce Kafka queue. Simple deployment is enough, and Kafka queue can be introduced when it needs to scale out the business system.
The services of Elasticsearch and Kibana in this article are provided by cloud vendors. In the considering of cost-saving factors, Helm can be chosen to quickly build offline development environments. The example for reference is as follows:
- Use Helm to quickly deploy Elasticsearch.
helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com/
helm install --name elasticsearch stable/elasticsearch \
--set master.persistence.enabled=false \
--set data.persistence.enabled=false \
--namespace logging
- Use Helm to quickly deploy Kibana.
helm install --name kibana stable/kibana \
--set env.ELASTICSEARCH_URL=http://elasticsearch-client:9200 \
--namespace logging
Reference
- https://github.com/fluent/fluentd
- https://kubernetes.io/docs/tasks/debug-application-cluster/audit/
- https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch
- https://github.com/fluent/fluent-bit
- https://github.com/anjia0532/gcr.io_mirror