A Comprehensive Analysis of Open-Source Time Series Databases (4)
By Zhaofeng Zhou (Muluo)
A series of articles on the analysis of open-source time series databases have been completed. These articles compare and analyze Hbase-based OpenTSDB, Cassandra-based KairosDB, BlueFlood and Heroic, and finally InfluxDB that ranks first among TSDBs. InfluxDB is a purely self-developed TSDB. When reviewing its relevant materials, I was more interested in the bottom-layer TSM, a storage engine optimized for time series data scenarios based on the idea of LSM. The materials share the whole process of InfluxDB from the initial use of LevelDB to its replacement with BoltDB, and finally to the decision of the self-developed TSM, profoundly describing the pain points of each stage and the core problems to be solved in the transition to the next stage, as well as the core design ideas of the final TSM. I like this kind of sharing, which tell you the whole process of technological evolution step by step, instead of directly telling you what technology is the best. The profound analysis of the problems encountered at each stage, the reason for making the final technological choice and so on, are impressive and many things can be learned.
However, the InfluxDB TSM is not described in enough detail, and it focuses more on the description of policies and behaviors. I recently saw an article “Writing a Time Series Database from Scratch”. Although the content is somewhat inconsistent with the title, it does introduce some really useful skills and experiences that describe the design idea of a TSDB storage engine. In addition, this storage engine is not a concept or toy, but is actually applied to production. It is a completely rewritten new storage engine in Prometheus 2.0, released in November 2017. This new storage engine claims to have brought “huge performance improvements”. Due to significant changes, backward compatibility can not be achieved, but it is estimated that it really brings a lot of surprises, otherwise there would be no such statement.
This article is mainly an interpretation of the article mentioned above, most of which is from the original text, with a few deletions. For more details, see the original English text. I’d welcome any corrections if I’ve misunderstood anything.
Similar to other mainstream time series databases (TSDBs), Prometheus uses metric names, labels (similar to tags), and metric values in data model definition. Every time series is uniquely identified by its metric name and a set of key-value pairs, which are also known as labels. Prometheus supports querying for time series based on labels, using simple and complex conditions. The storage engine design mainly considers data storage (more writes than reads), data retention, and data query based on the characteristics of the time series data. Prometheus has not yet involved data analysis.
The above picture is a simple view of the distribution of data points, with the horizontal axis being time and the vertical axis being the timeline. Points in the region are data points. Every time Prometheus retrieves data, it receives a vertical line in the area as shown in the picture. This expression is very vivid. Each timeline generates only one data point each time, but there are many timelines to generate time points concurrently. Connecting these data points gives you a vertical line. This feature is very important, because it affects the optimization strategy for data writing and compression.
V2 Storage Engine
This article mainly describes some design ideas of the V3 storage engine, the latest version of the V2 storage engine. The V2 storage engine stores data points generated by each timeline separately into different files. We will discuss some topics based on this design:
- Optimization for writes: Sequential and batched writes are the ideal write pattern for HDDs and SSDs. As mentioned above, the data received by Prometheus each time is a vertical line consisting of many data points. However, these data points belong to different timelines. In the current design, data for each timeline is stored in a separate file. Therefore, we need to write a very small amount of data to many different files each time. To solve this problem, the optimization strategy of the V2 storage engine is to write larger chunks of data in sequence. Data generated by a single timeline must be written in batches. This requires accumulation of a certain amount of data points for a certain period of time at the timeline dimension. In addition to batch writing, the benefits of writing data in larger chunks includes the optimized hot data query efficiency and data compression ratio. The V2 storage engine uses the same compression algorithm as that of Facebook Gorilla, which compresses 16-byte data points to an average of 1.37 bytes, saving 12 times the memory and disk space. Writing data in chunks requires the data to accumulate for a specified amount of time in the server memory. This means hot data is basically stored in the memory, which ensures very high query efficiency.
- Query optimization: There are many different query scenarios for time series data. We may query for data at a certain time point of a certain time line, data at multiple time points of multiple time lines, or data of multiple time lines in a certain period of time. As shown in the above data model diagram, queries are a rectangular data block on a two-dimensional plane. For both SSDs and HDDs, the disk data read-friendly optimization should allow each query to be completed by reading only a few random places on the disk, and reading large chunks of data in sequence. This is closely related to data distribution in the disk. Essentially, it is related to data writing. However, it does not necessarily have to be real-time write optimization. It can be optimized through subsequent data collation.
The V2 storage engine provides some good optimization strategies, which mainly include writing data in larger chunks and hot data memory caching. These two optimization strategies are passed down to V3. V2 also has many drawbacks:
- The number of files increases with the number of timelines, which may run out the inodes sooner or later.
- Even with the chunk-based approach, the IOPS requirements will still be high if too many timelines are involved in a single write.
- It is infeasible to keep all files open for reads and writes all the time. When the files are queried, we have to open a large number of files, find and read relevant data points into memory, and close them again. This would result in high query latencies.
- Data deletion requires scanning a large number of files and rewriting data to them. They both take a considerably long time.
- Data chunks need to accumulate in memory for a certain period of time before they are written to disk. V2 uses the mechanism of periodically writing checkpoints to avoid data loss in the memory. However, generally, the time to record checkpoints is usually longer than the time window of acceptable data loss. In addition, the time required for restoring the checkpoint is also quite long.
With regard to the timeline indexes, the V2 storage engine uses LevelDB to store the mapping relations between labels and timelines. When the number timelines increases to a certain degree, the query efficiency becomes very low. Generally, the timeline base number is relatively small. Because the application environment rarely changes, and the timeline base number remains stable when the operation is stable. However, with improper label settings, for example, a dynamic value (such as the program version number) is used as the label, the label value changes every time the application is upgraded. As time goes by, there will be more and more inactive timelines (known as series churn in the Prometheus context). The size of timelines becomes larger and larger, affecting index query efficiency.
V3 Storage Engine
The V3 engine was completely redesigned to address problems with the V2 engine. The V3 engine can be seen as a simplified version of LSM that has been optimized for time-series data scenarios. It can be understood with the LSM design ideas. Let’s first look at the file directory structure of the V3 engine.
All data is stored in the data directory. At the top level of the data directory, we have a sequence of numbered blocks, prefixed with ‘b-’. Each block holds a file containing an index, a chunks directory, and a meta.json file. Each “chunks” directory holds raw chunks of data points for various series. The chunk of V3 is the same as the chunk of V2. The only difference is that, a chunk of v3 contains a lot of timelines, not just one. Index is the index of chunks under a block, which can be used to rapidly position a timeline and the chunk of the data based on a label. meta.json is a simple description file that simply holds human-readable information about the block to easily understand the state of our storage and the data it contains. To understand the design of the V3 engine, you only need to understand a few questions: 1. What is the storage format of a chunk file? 2. What is the index storage format and how to achieve fast search? 3. Why does the last block contain a “wal” directory instead of chunks?
Prometheus divides data into multiple non-overlapping blocks by the time dimension. Each block acts as a fully independent database containing all time-series data for its time window. Every block of data is immutable after it is dumped to chunk files. We can only write new data to the most recent block. All new data is written to an in-memory database. To prevent data loss, all incoming data is also written to a temporary write ahead log (WAL).
V3 completely borrows the design idea of LSM, and has made some optimizations based on the time series data features. There are many benefits:
- When querying data for a time range, we can quickly exclude irrelevant blocks. Each block has an independent index, which can effectively solve the “Series Churn” problem of V2.
- The in-memory data is dumped to chunk files, which allows efficiently writing larger chunks of data in sequence, and is good for both SSDs and HDDs.
- Similar to V2, the most recent data is held in memory. The most recent data is also the most frequently accessed (hottest) data. Holding this data in memory allows for the most efficient data queries.
- The deletion of old data becomes very simple and efficient. We only need to delete a small number of directories.
In V3, blocks are cut in a reasonable interval of two hours. This interval can neither be too long nor too short. A longer interval requires a larger memory, because you have to keep data for more than two hours. A shorter interval results in too many blocks, and you have to query more files than necessary. The two-hour interval is determined based comprehensive consideration. However, when querying data with a larger time range, it is inevitable to query multiple files. For example, a week-long query may require querying 84 files. Similar to LSM, V3 uses a compaction strategy for query optimization. It merges small blocks into larger blocks. It can also modify existing data along the way, for example, dropping deleted data, or restructuring sample chunks for improved query performance. The cited article does not describe much about the compaction of V3. If interested, you can take a look at how InfluxDB achieves this. InfluxDB has a variety of compaction strategies which are applicable for different situations.
The above example is a schematic diagram of expired data deletion in V3, which is much simpler than V2. If the entire data block has expired, we can delete the folder directly. However, for blocks with only partially expired data, we cannot delete the entire folder. We have to wait for the expiration of all data, or perform compaction. Note that, the older data gets, the larger the blocks may become as we keep compacting previously compacted blocks. An upper limit has to be applied so blocks don’t grow to span the entire database. Generally, this limit is set based on the retention window.
Basically, that’s all about the data storage design of Prometheus. It’s relatively clear and simple. Similar to other TSDBs, Prometheus also has an index. The index structure of V3 is relatively simple. Here I’d like to directly use the example given in the cited article:
Based on the description of that article, V3 does not use LevelDB like V2 does. In a persisted block, the index is immutable. For the most recent block that we write data to, V3 holds all the indexes in memory, and maintains an in-memory structure. These indexes are persisted to on-disk files when the block is closed. This is very easy. By maintaining the mapping relationship between the timelines and IDs and between labels and ID lists, we can ensure high query efficiency. Prometheus has an assumption that: “a real-world dataset of ~4.4 million series with about 12 labels each has less than 5,000 unique labels”. All these occupy a surprisingly small amount of memory. InfluxDB uses a similar strategy, while other TSDBs use ElasticSearch as the indexing engine.
For time series data which is written more than read, the LSM storage engines have a lot of advantages. Some TSDBs, such as Hbase and Cassandra, are distributed databases based directly on open source LSM engines. Some are developed based on LevelDB or RocksDB. There are also some fully self-developed TSDBs represented by InfluxDB and Prometheus. The reason for the existence of so many different types of TSDBs is that we can still do a lot to optimize the specific scenario of time series data. For example, index and compaction strategy. The design idea of the Prometheus V3 engine is very similar to that of InfluxDB. Their optimization ideas are highly consistent. There will be more changes upon the emergence of new requirements.