Cloud-Native Prometheus Solution: High Performance, High Availability, and Zero O&M

Image for post
Image for post

By Yuanyi

Alibaba Cloud Log Service (SLS) strives to develop itself into a DevOps data mid-end that provides rich capabilities including host data access, storage, analysis, and visualization. This article describes how SLS supports the Prometheus solution to provide a cloud-native Prometheus engine that features high performance, high availability, and zero O&M.

Prometheus — De-facto Standard for Cloud-Native Monitoring

Prometheus is the second CNCF graduated project, and has become the most popular one apart from Kubernetes. It is no exaggeration to say that Prometheus has become a de-facto standard of cloud-native monitoring. If the first step of enabling cloud native is to build a Kubernetes environment, then Prometheus is the first step to implement cloud-native monitoring.

Image for post
Image for post

After you deploy apps in Kubernetes, you will find it necessary to check the running statuses of the cluster and the apps. However, some of the monitoring methods in the virtual machine (VM) environment are no longer applicable. Although there are several alternatives to Prometheus, it is the best choice for many applications due to these advantages:

  1. Prometheus is easy to deploy. Especially in the Kubernetes environment. You only need several YAML files to configure Prometheus and its monitoring items after Prometheus Operator is deployed. Then, you will gain overall information about the monitored products in the monitoring dashboard by using Grafana and its rich set of Prometheus templates.
  2. Prometheus has a wealth of service discovery mechanisms. In particular, it can collect Kubernetes pod indicators by only declaring one simple annotation.
  3. Prometheus’ exporters cover almost all open-source software systems and are supported by many commercial software and systems, such as Alibaba Cloud CloudMonitor which provides a Prometheus Exporter module.
  4. Prometheus provides software development kits (SDKs) for almost all languages so that you can expose metrics in an app. These SDKs are elegantly designed and convenient to expose metrics.
  5. Prometheus is an open-source project under CNCF. Therefore, you do not need to worry that the software updates will be stopped in a few years.
  6. If you take a closer look at Kubernetes code, you will find that all Kubernetes components expose Prometheus metrics, and that Prometheus is indispensable for monitoring Kubernetes.

Prometheus’ Challenges in a Production Environment

  1. Memory usage: Prometheus caches all the data in the last two hours in memory. If the number of pods increases, the number of metrics in the system also rises, which may eventually cause an out-of-memory (OOM) issue. In some cases, a 100-node cluster requires an exclusive memory of 64 GB to run Prometheus.
  2. Recovery from exceptions: Prometheus persists the data written in real time by using the binlog method. It replays the binary logs to recover data when the system crashes. However, the recovery may take long as the data is stored in memory for two hours. Once the cluster is restarted due to an OOM issue, Prometheus will restart again and again endlessly.
  3. Storage duration: This is what Prometheus is the most complained about. Prometheus’ long-term storage (LTS) setting supports up to 15 days of data storage by default. You can adjust the startup parameters to set a longer storage duration, but persistent storage may be unfeasible due to the restrictions of a single instance.
  4. Single instance: Prometheus is deployed on a single instance. Its data capture, storage, and calculation are all implemented at a single point. This makes it difficult to use Prometheus in a large-scale cluster. The community provides a variety of distributed solutions to address this issue, such as Cortex, Thanos, and M3DB.
  5. AIOps related: Prometheus still adopts traditional monitoring metrics. PromQL focuses on arithmetical operations and does not support time series AI algorithms, such as prediction, outlier detection, change point detection, break point detection, and multi-cycle estimation algorithm.
Image for post
Image for post

SLS and Cloud-native Technologies

Image for post
Image for post

SLS provides a wide range of data access methods and supports many data access approaches related to cloud-native observability. The preceding figure shows the projects that are supported by SLS for data access in the CNCF landscape. The monitoring, logging, and tracing features all support CNCF graduated projects, such as Prometheus, Fluentd, and Jaeger. The main reasons for using SLS to store Prometheus monitoring data include:

  1. SLS data can be stored persistently. Many users want to store key Prometheus metrics persistently in SLS.
  2. Many users now store their logging and tracing data in SLS and want to do so for Prometheus data as well, so as to implement integrated observability data solutions and reduce O&M workloads.
  3. SLS provides many metric-related AIOps algorithms, such as multi-cycle estimation, prediction, outlier detection, and time series classification. Clients also expect more intelligent use of Prometheus data.
  4. SLS also supports data pipeline models. Prometheus can enable faster alarming if its metrics are interconnected with downstream systems for stream computing. In addition, Prometheus can enable offline statistical analysis if its metrics are interconnected with data warehouses.

SLS Solutions for Prometheus

Image for post
Image for post

Compared with the community-provided Prometheus distributed extensions, such as Cortex, Thanos, M3DB, FiloDB, and VictoriaMetrics, the SLS’s distributed implementation solution is closer to the community’s goal of solving the restrictions on the use of native Prometheus.

  1. Compatibility: SLS implementation reuses the Prometheus code without any modification needed. This ensures the SLS implementation keeps pace with the official updates in real time.
  2. Global view: SLS is an SaaS-based service and supports multitenancy and multiple instances. Therefore, it can write the data of multiple clusters to the same instance to display a global view.
  3. Persistent storage: SLS data supports the TTL mechanism and persistent storage.
  4. High availability: Each instance contains multiple shards, and different shards are allocated to different hosts. The failure of hosts where some shards are stored does not compromise the overall writing performance. Each shard has three replicas on Apsara Distributed File System to ensure the reliability of each shard.

In addition to supporting these requirements of the community, SLS can provide the following advantages for Prometheus:

  1. Larger storage: SLS is a fully cloud-based service. The storage space for each user is unlimited.
  2. Lower costs: In terms of labor cost, SLS’s Prometheus access method does not require the operation and maintenance of Prometheus instances. In terms of usage, SLS MetricStore uses a pay-as-you-go model without the need to separately purchase hosts and disks for data calculation and storage.
  3. Faster speed: The storage and computing separation architecture of SLS gives full play to cluster capabilities, enabling faster end-to-end processing especially in processing of massive data.
  4. More intelligent algorithms: All of SLS’s metric-related AI algorithms can be applied to Prometheus data, such as multi-cycle estimation, prediction, outlier detection, and time series classification, to add AI power to Prometheus.
  5. More extensive ecosystems: SLS features sound connectivity with upstream and downstream ecosystems. Therefore, you can integrate Prometheus metrics with stream computing for faster alarming, with data warehouses for offline statistical analysis, and with OSS for archival storage.
  6. Better support: Observability requires full connectivity between metrics, logging data, and tracing data. SLS is committed to building a unified OpenTelemetry storage platform to act as an underlying data foundation for all kinds of intelligent data apps.

Cloud-native Kubernetes Monitoring

Before You Begin

  1. Create a MetricStore instance in SLS.

Installation of Independently Built Kubernetes

If you opt for other connection approaches, see the official instructions of Helm package for installation. Before the installation, you need to create a secret and change the default configuration. For more information, see the following description of installing Alibaba Cloud Kubernetes.

Installing Alibaba Cloud Kubernetes

1. Create a Secret

  • In the left-side navigation pane, select Namespace and create a namespace named monitoring.
  • In the left-side navigation pane, choose Configuration > Secrets. Select the monitoring namespace you just created. If this namespace does not appear, forcibly refresh the entire page.
  • Click Create to start creating a secret. Set the secret name to sls-ak, and add two key-value pairs of username and password. Populate them with your Alibaba Cloud AccessKeyId and AccessKeySecret, respectively. Use an RAM user account and grant only the SLS write permission to the account. For more information about authorization, see Permission to write data to a specified project.
Image for post
Image for post

2. Create a Prometheus Operator

  1. In the left-side navigation pane, choose Marketplace > App Catalog.
  2. Click ack-prometheus-operator.
  3. In the pop-up installation page, click the Parameters tab and modify the configuration items. Major modifications include the following.
  • Change the value of retention under prometheusSpec. The value 1d or 12h is recommended.
  • Set enable under prometheusSpec to true, and add the remoteWrite configuration. Modify the URL parameters as well.
- basicAuth:
name: sls-ak
key: username
name: sls-ak
key: password
batchSendDeadline: 20s
maxBackoff: 5s
maxRetries: 10
minBackoff: 100ms
### The URL is https://{sls-enpoint}/prometheus/{project}/{metricstore}/api/v1/write.
### For the sls-endpoint settings, see
### Replace project and metricstore values with your own project and metricstore.

Query and Analysis of Diversified Time Series Data

Image for post
Image for post

SLS provides three time-series data modes. SQL plays a dominant role in time series data queries, and SQL’s support for calling PromQL ensures both easier syntax and powerful functionality. In addition, SLS supports directly calling PromQL to support the open-source ecosystem, such as the integration with Grafana.

1. Pure PromQL Queries

2. Pure SQL Queries

3. SQL + PromQL Hybrid Queries

Pure PromQL queries:

SELECT promql_query('up') FROM metrics
SELECT promql_query_range('up', '1m') FROM metrics

PromQL as subqueries:

SELECT sum(value) FROM (SELECT promql_query('up') FROM metrics)

Complicated SQL queries with PromQL as subqueries:

select ts_predicate_arma(time, value, 5, 1, 1 , 1, 1, true) from ( SELECT (time/1000) as time, value   from ( select  promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m]))', '10m') as t from metrics ) order by time asc ) limit 10000

Currently, SLS supports the following frequently used APIs in PromQL: query(varchar), query_range(varchar, varchar?), labels(,label_values(varchar), and series(varchar).

Specifically, query_range also supports an automatic step when the second parameter is empty.

Multiple Visualization Features

Access to Prometheus Data in SLS

Image for post
Image for post

Access to Prometheus Data in Grafana

Image for post
Image for post

Prometheus has no authentication mode. Unlike Prometheus, the Prometheus interface provided by SLS supports the HTTPS protocol and requires BasicAuth authentication, making data more secure.

Note: Make sure you are using HTTPS.

Image for post
Image for post

1. Add a data source, and select Prometheus.

Image for post
Image for post

2. Configure the URL.

Image for post
Image for post

Enter the aforementioned URL.

3. Enable Basic Auth, and enter the AK information.

Image for post
Image for post
Image for post
Image for post

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store