Log Platform Solution in the Cloud Native Architecture

8 min readJun 29, 2021

By Ford, FDD architect, with 12 years of software development experience, mainly responsible for the design of cloud native architecture with focus on infrastructures, Service Mesh advocate and practitioner of continuous delivery and agility.

Characteristics of the Log System in the Cloud Native Architecture

In recent years, many businesses have experienced unexpected growth, which has increased the pressure on traditional software architecture. To cope with this, many have started adopting microservices for their software architecture. Consequently, the number of online applications has multiplied after horizontal and vertical expansion. In traditional monolithic application scenarios, the methods of log query and analysis by using tail, grep, and awk commands cannot meet new requirements. In addition, these methods cannot cope with the huge increase of application logs and the complex operating environment of distributed projects in cloud native architecture.

During the transformation of the cloud native architecture, observability has become the enterprise-level issue for quick fault locating and diagnosis, under complex dynamic cluster environments. Logs are particularly important as one of the three major elements that can be monitored. The three elements are logs, metrics, and traces. The services of log system are no longer limited to application system diagnosis. Now business, operation, BI, audit, and security are also included. The ultimate goal of the log platform is to achieve digitization and intelligence of all aspects in the cloud native architecture.

Three major monitoring elements: logs, metrics, and traces

The log solutions based on cloud native architectures are quite different from that based on physical machines and virtual machines. For example:

Dynamic log environment: In the Kubernetes cluster environment, it is normal of the automatic scaling of applications, the destruction and drift of pods, and the enabling and disabling of working nodes. In these cases, logs are transient and will be destroyed passively with the destruction and drift of pods. Therefore, log data needs to be collected into concentrated storage devices in real time. Besides, new requirements on the scalability and adaptability of log collector in complex dynamic environments are proposed.
Resource consumption: In the traditional ELK architecture, JDK-based Logstash and Filebeat usually consume about 500 MB and 12 MB of memory respectively. In the microservice and cloud native architectures, services are usually split to small fractions, so there should be less resource consumptions on services during data collection.
The O&M cost of a log platform: It is complex and cumbersome to operate and maintain log collection and management platforms in a set of dynamic environments. The log platform should adopt SaaS as the underlying infrastructure for one-click deployment and dynamic adaptation.
Convenient log analysis: The core function of a log system is troubleshooting. The speed of troubleshooting directly determines the accident response speed and the losses. A set of visual, high-performance, and intelligent analysis functions can help users quickly locate problems.

Log System Design in the Cloud Native Architecture

Solution Selection

Log collection solutions in the cloud native architecture

Based on the above advantages and disadvantages, we choose Solution 3. Solution 3 is selected because it balances scalability, resource consumption, deployment, and maintenance.

The following figures show the architectures of each solution.

Solution 1: Built-in collection components in applications for asynchronous collections

Solution 2: Pod partner container with Sidecar mode

Solution 3: Unified collections of the host

Solution Implementation and Verification

Solution Description

When a cluster starts, a Fluent-bit agent is started on each machine in DaemonSet mode to collect logs and send them to Elasticsearch. Each agent mounts the directory /var/log/containers/. Then, the agent uses the tail plug-in of Fluent-bit to scan log files of each container and directly send these logs to Elasticsearch.

The log of /var/log/containers/ is mapped from the container log of a Kubernetes node, as shown in the following figures:

File path of the node in the directory of /var/log/containers/

Monitoring of Fluent-bit and Input configuration

@INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-elasticsearch.conf  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10

The collection agent is deployed based on Kubernetes cluster. When nodes in the cluster are scaled out, Fluent-bit agents of new nodes are automatically deployed by kube-scheduler.

The current services of Elasticsearch and Kibana are provided by cloud vendors. The services provide the X-pack plug-in and support permission management feature which is only available in Business Edition.

Implementation

1. Configure Fluent-bit collectors, including collectors for server, input, filters, and output.

2. Create RBAC permission of Fluent-bit in the Kubernetes cluster.

fluent-bit-service-account.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging

fluent-bit-role.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluent-bit-read
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]

fluent-bit-role-binding.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-read
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-read
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: logging

3. Deploy Fluent-bit on cluster nodes of Kubernetes in DaemonSet mode.

fluent-bit-ds.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    k8s-app: fluent-bit-logging
    version: v1
    kubernetes.io/cluster-service: "true"
spec:
  template:
    metadata:
labels:
k8s-app: fluent-bit-logging
version: v1
kubernetes.io/cluster-service: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "2020"
prometheus.io/path: /api/v1/metrics/prometheus
    spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:1.2.1
imagePullPolicy: Always
ports:
  - containerPort: 2020
env:
- name: FLUENT_ELASTICSEARCH_HOST
  value: "elasticsearch"
- name: FLUENT_ELASTICSEARCH_PORT
  value: "9200"
volumeMounts:
- name: varlog
  mountPath: /var/log
- name: varlibdockercontainers
  mountPath: /var/lib/docker/containers
  readOnly: true
- name: fluent-bit-config
  mountPath: /fluent-bit/etc/
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
  path: /var/log
- name: varlibdockercontainers
hostPath:
  path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
  name: fluent-bit-config
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"

Log Platform Verification

Simple Query
Word Segmentation Query
Precise Query
Compound Query (AND, OR)
Field-based Query
Project Filtering
Machine and Node Filtering
Regular Query
Interval Query
Context Query
Customized Display of Log List
Time Selection

Collection of Cluster Audit Log

In the solution, Fluent-bit collects event audit logs of Kubernetes clusters and generates corresponding logs for status changes caused by kube-apiserver operations. The following kubernetes-audit-policy.yaml defines which audit logs are collected. To do that, the reference of this configuration in kube-api startup file is needed by using --audit-policy-file.

kubernetes-audit-policy.yaml

apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
# Don't generate audit events for all requests in RequestReceived stage.
omitStages:
  - "RequestReceived"
rules:
  # Log pod changes at RequestResponse level
  - level: RequestResponse
    resources:
    - group: ""
      # Resource "pods" doesn't match requests to any subresource of pods,
      # which is consistent with the RBAC policy.
      resources: ["pods"]
  # Log "pods/log", "pods/status" at Metadata level
  - level: Metadata
    resources:
    - group: ""
      resources: ["pods/log", "pods/status"]  # Don't log requests to a configmap called "controller-leader"
  - level: None
    resources:
    - group: ""
      resources: ["configmaps"]
      resourceNames: ["controller-leader"]  # Don't log watch requests by the "system:kube-proxy" on endpoints or services
  - level: None
    users: ["system:kube-proxy"]
    verbs: ["watch"]
    resources:
    - group: "" # core API group
      resources: ["endpoints", "services"]  # Don't log authenticated requests to certain non-resource URL paths.
  - level: None
    userGroups: ["system:authenticated"]
    nonResourceURLs:
    - "/api*" # Wildcard matching.
    - "/version"  # Log the request body of configmap changes in kube-system.
  - level: Request
    resources:
    - group: "" # core API group
      resources: ["configmaps"]
    # This rule only applies to resources in the "kube-system" namespace.
    # The empty string "" can be used to select non-namespaced resources.
    namespaces: ["kube-system"]  # Log configmap and secret changes in all other namespaces at the Metadata level.
  - level: Metadata
    resources:
    - group: "" # core API group
      resources: ["secrets", "configmaps"]  # Log all other resources in core and extensions at the Request level.
  - level: Request
    resources:
    - group: "" # core API group
    - group: "extensions" # Version of group should NOT be included.  # A catch-all rule to log all other requests at the Metadata level.
  - level: Metadata
    # Long-running requests like watches that fall under this rule will not
    # generate an audit event in RequestReceived.
    omitStages:
      - "RequestReceived"

Summary

As the distributed system in the cloud-native architecture grows complex, logs are becoming scattered. So, it is difficult to monitor application and troubleshoot, and the efficiency is low. The centralized log platform of Kubernetes cluster in this article aims to solve these problems. The collection, retrieval, and analysis of cluster logs, application logs, and security logs, and Web management are centrally controlled by the platform. It realizes quick troubleshooting, and become an important way to solve problems efficiently.

During production and deployment, the introduction of Kafka queue can be determined based on the business system capacity. In the offline environment, it doesn’t have to introduce Kafka queue. Simple deployment is enough, and Kafka queue can be introduced when it needs to scale out the business system.

The services of Elasticsearch and Kibana in this article are provided by cloud vendors. In the considering of cost-saving factors, Helm can be chosen to quickly build offline development environments. The example for reference is as follows:

Use Helm to quickly deploy Elasticsearch.

helm repo add incubator https://kubernetes-charts-incubator.storage.googleapis.com/
helm install --name elasticsearch stable/elasticsearch \
    --set master.persistence.enabled=false \
    --set data.persistence.enabled=false \
--namespace logging

Use Helm to quickly deploy Kibana.

helm install --name kibana stable/kibana \
    --set env.ELASTICSEARCH_URL=http://elasticsearch-client:9200 \
--namespace logging

Log Platform Solution in the Cloud Native Architecture

Characteristics of the Log System in the Cloud Native Architecture

Log System Design in the Cloud Native Architecture

Solution Selection

Solution Implementation and Verification

Log Platform Verification

Collection of Cluster Audit Log

Summary

Reference

Original Source:

Log Platform Solution in the Cloud Native Architecture

Alibaba Developer April 22, 2021 160 By Ford, FDD architect, with 12 years of software development experience, mainly…

Written by Alibaba Cloud

No responses yet