Prometheus: The Unicorn in Metrics

Image for post
Image for post

Overview

Prometheus is an all-in-one monitoring and alerting platform developed by SoundCloud, with low dependency and full functionality.

Key Functions:

  • Multi-dimensional data model: metrics, and labels.
  • Flexible query language: PromQL. In the same query statement, operations, such as multiplication, addition, concatenation, and fractional bits taken, can be performed on multiple metrics.
  • It can be deployed independently. It is out-of-the-box, and does not rely on distributed storage.
  • It collects data based on HTTP using the pull method.
  • It is compatible with the push method through the push gateway.
  • It obtains monitored items through static configuration or service discovery.
  • It supports charts, dashboards and more.

Core Components:

  • Prometheus server: To collect and store time series data.
  • Client library: To connect to the Prometheus server. It can be used to query and report data.
  • Push gateway (for short-running tasks): To monitor the summary node of the data in batches and for a short period of time. It is mainly used for business data reporting.
  • Customized exporters (such as, HAProxy, StatsD, and Graphite): To reports the plug-ins for the machine data.
  • Alert manager: Prometheus can configure rules, and then periodically query the data. When the condition is triggered, the alert is pushed to the configured alert manager.
  • A variety of support tools

Advantages and Disadvantages:

  • In terms of scenarios, PTSDB is suitable for numerical time series data compared with InfluxDB. However, it is not suitable for log-type time series data, and index statistics for billing. InfluxDB is intended for the universal time series platform, and scenarios including log monitoring. Prometheus focuses more on the metrics solutions. A lot of similarities exist between the two systems, including collection, storage, alert, and display.
  • The combination of InfluxData: Telegraf + InfluxDB + Kapacitor + Chronograf
  • The combination of Promethues: Exporter + Prometheus server + Alert Manager + Grafana
  • The collection terminal of Prometheus mainly adopts the push-pull method, while supports the push method through the push gateway. Telegraf, the collection tool of InfluxData, mainly adopts the push method.
  • In the aspect of storage, the two are similar in the basic idea. However, differences exist in the key points, including, for example, the indexing of the timeline and the way to solve the disorder.
  • InfluxDB supports multi-valued model, the String type, and more. The content supported by InfluxDB is more abundant.
  • Kapacitor is a tool that combines the data processing, storage rules, alert rules, and alert notification functions of Prometheus. The alert manager provides further grouping, deduplication, and more.
  • The cluster mode previously provided by InfluxDB has been removed, and now only the relay-based high availability is retained. The cluster mode is released as a feature of the commercial version. Prometheus provides a unique cluster mode, which aggregates multiple Prometheus nodes through multi-level proxy to implement the extension.
    Prometheus has also enabled the remote storage, so the two system are integrated with each other, and InfluxDB is the remote storage of Prometheus.
  • The data model of OpenTSDB is almost the same as that of Prometheus, but the PromQL query language is simpler, and OpenTSDB is more functional. OpenTSDB relies on the Hadoop ecosystem, while Prometheus grows in the Kubernetes ecosystem.

Data Model

  • Single-valued model is adopted. The core concepts of the data model are metrics, labels, and samples.
  • Format: {=, …}
  • For example: http_requests_total{method=”POST”,endpoint=”/api/tracks”}.
  • The metric name has business implications. For example: http_request_total.
  • The types of metrics are divided into Counter, Gauge, Historgram, and Summary.
  • Labels are used to represent dimensions. Samples consist of timestamps and numeric values.
  • jobs and instances
  • Prometheus automatically generates targets and instances as labels.
  • job: api-server
  • instance 1: 1.2.3.4:5670
  • instance 2: 1.2.3.4:5671

Overall Design Concept

Image for post
Image for post
  • Configuration: It configures the parsing, verifying, and loading of items
  • Scrape discovery manager: The service discovery manager communicates with the service scrape server through a synchronous channel. When the configuration changes, the service must be restarted to take effect.
  • Scrape manager: It scrapes metrics and sends them to storage components.
  • Storage:
  • Fanout storage: the proxy abstraction layer of storage. It shields the details of the local storage and the remote storage at the underlying layer, writes samples in double write mode, and reads them in a merged way.
  • Remote storage: The remote storage creates a queue manager, which sends data in turn based on the load, and reads merged data in the remote endpoint from the client.
  • Local storage: a lightweight time series database based on the local disk.
  • PromQL engine: The query expression is parsed into an abstract syntax tree and an executable query, and the data is loaded in the Lazy Load mode.
  • Rule manager: manages the alert rules.
  • Notifier: notifies the distribution manager.
  • Notifier discovery: notifies the Service Discovery.
  • Web UI and API: the embedded management interface. It can run query expression analysis and show the results.

PTSDB Overview

This article focuses on the analysis of Local Storage PTSDB. The core of PTSDB includes: inverted index + window storage block.

Core Data Structure and Storage Format

The core data structure of PTSDB is the HeadAppender. When the Appender commits, the WAL log encoding is flushed to the disk and written to the head block.

Image for post
Image for post
Image for post
Image for post

Write Ahead Log

WAL has three encoding formats: timelines, data points, and delete points. The general policy is to scroll based on the file size, and perform cleanup based on the minimum memory time.

  • During the compaction, WAL executes the cleanup policy based on time. WAL logs with the memory time less than the minimum memory time of the block will be deleted
  • When restarting, first open the latest segment, and resume loading data from the log to memory.
Image for post
Image for post

Metadata File

The meta. json file records the details of chunks. For example, several small chunks that the new compactin chunk comes from. The statistical information of this chunk. For example, the minimum and maximum time range, the timeline, and the number of data points.

Image for post
Image for post

Index

An index is partially written into the Head Block first, and is flushed to the disk with compaction triggered.

  • Series stores two parts of information. One part is the symbol table reference of the label key-value pair. The other part is the index from the timeline to the data file, which cuts and stores the specific location information of the data block records according to the time window, so that a large number of records in non-query windows can be quickly skipped during query.
  • Posting stores the posting refid corresponding to each tag pair inverted.
  • OffsetTable speeds up the lookup of a layer of mapping and loads this part of the data to the memory. OffsetTable is mainly associated with the LabelIndex and Posting data blocks. TOC is the position offset of each data block. If no data exists, the search can be skipped.
Image for post
Image for post

Chunks

Data points are stored in the Chunks directory. Each data point is 512 MB by default. The data encoding method supports XOR. Chunks are indexed by refid, which consists of segmentid, and offset inside the file.

Image for post
Image for post

Tombstones

Records are deleted by marking, and the data is physically cleared when compaction and reloading are performed. The deleted records are stored in units of time windows.

Image for post
Image for post

Query PromQL

The query language of Promethues is PromQL. The syntax parsing AST, the execution plan and the data aggregation are completed by PromQL. The fanout module sends query data to both the local and remote endpoints simultaneously. PTSDB is responsible for local data retrieval.

http_requests_total
select * from http_requests_total where timestamp between xxxx and xxxx
http_requests_total{code="200", handler="query"}
select * from http_requests_total where code="200" and handler="query" and timestamp between xxxx and xxxx
http_requests_total{code~="20"}
select * from http_requests_total where code like "%20%" and timestamp between xxxx and xxxx
http_requests_total > 100
select * from http_requests_total where value > 100 and timestamp between xxxx and xxxx
http_requests_total[5m]
select * from http_requests_total where timestamp between xxxx-5m and xxxx
count(http_requests_total)
select count(*) from http_requests_total where timestamp between xxxx and xxxx
sum(http_requests_total)
select sum(value) from http_requests_total where timestamp between xxxx and xxxx
topk(3, http_requests_total)
select * from http_requests_total where timestamp between xxxx and xxxx order by value desc limit 3
irate(http_requests_total[5m])
select code, handler, instance, job, method, sum(value)/300 AS value from http_requests_total where timestamp between xxxx and xxxx group by code, handler, instance, job, method;

Key Technical Points of PTSDB

Solve the Disorder

PTSDB uses the minimum time window to solve the disorder, and specifies a valid minimum timestamp. Data smaller than this timestamp will be discarded and not processed.

Memory Management

MMAP is used to read compressed and merged large files (without occupying too many handles).

Compaction

The main operations of the compaction include merging blocks, deleting expired data, and refactoring chunk data.

  • To improve the deletion efficiency, the location of the deleted time series data is recorded when it is deleted. The entire directory of the block is deleted when all data of the block needs to be deleted.
  • The size of block merging also needs to be limited to avoid retaining excessive deleted space (extra space usage).

Snapshot

PTSDB provides the snapshot data backup function. You can use the admin/snapshot protocol to generate snapshots. The snapshot data is stored in the data/snapshots/- directory.

PTSDB Best Practices

  • Generally, each sample stored in Prometheus occupies about 1–2 bytes. To plan the capacity of the local disk space of the Prometheus server, use the following formula:
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
  • The limitations of PTSDB are clustering and replication. Therefore, when a node goes down, data in a certain window is lost.
    If the data reliability required by the business is not extremely demanding, the local disk can also store persistent data for several years.
    When the PTSDB corruption occurs, it can be recovered by removing the disk directory or the directory of a certain time window.
  • For PTSDB, the high availability, and the preservation of clusters and historical data can be achieved through external solutions, which are not covered in this article.
  • Due to the limitation of the historical solution, PTSDB used a single time line to store a file in the early days. This solution has many drawbacks. For example:
    The disk-flushing pressure of snapshots; the burden of regular file cleanup; for low base and long period queries, a large number of files need to be opened; and the timeline expansion may cause inode to run out.

Challenges for PTSDB

During usage, PTSDB has also encountered some problems in some aspects. For example:

  • After the cold start, the CPU and memory usage will increase during the push phase.
  • Issues, such as a CPU spike, may occur during high-speed writing.

Summary

PTSDB, as the implementation standard for storing time series data in the K8S monitoring solution, has gradually increased its influence and popularity in time series. Alibaba TSDB currently supports the implementation of remote storage through the Adapter.

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store