A Comprehensive Analysis of Open-Source Time Series Databases (3)

InfluxDB

  • Simple architecture: For a stand-alone InfluxDB, only a binary needs to be installed and it can be used without any external dependencies. Here are a few negative examples. The bottom layer of OpenTSDB is HBase, so ZooKeeper, HDFS and others must be used together. If you are not familiar with the Hadoop technology stack, it is generally difficult to operate and maintain, which is also one of the most common complaints. KairosDB is slightly better. It depends on Cassandra and ZooKeeper, and H2 can be used for stand-alone testing. In general, a TSDB, which depends on an external distributed database, is a little more complex in architecture than a fully self-contained TSDB. After all, a mature distributed database is itself very complex, which, however, has been completely eliminated in the era of cloud computing.
  • TSM engine: The self-developed TSM storage engine is used in the bottom layer. TSM is also based on the idea of LSM, and provides extremely powerful writing ability and high compaction ratio. A more detailed analysis is made in the following sections.
  • InfluxQL: A SQL-Like query language is provided, which greatly facilitates the use of the database. The ultimate goal of database evolution in usability is to provide query languages.
  • Continuous query: With CQ, the database can support auto-rollup and pre-aggregation. For common query operations, CQ can be used to accelerate through pre-compute.
  • TimeSeries index: Tags are indexed for efficient retrievals. Compared with OpenTSDB, KairosDB and others, this function makes InfluxDB more efficient for tag retrieval. OpenTSDB has made a lot of query optimizations on tags retrieval. However, it is restricted by HBase functions and data models, and these optimizations do not work. However, the implementation in the current stable version uses the memory-based index, which is relatively simple in implementation and has the highest query efficiency. However, this also comes with many problems, which are described in detail in the following sections.
  • Plugin support: The database supports custom plug-ins and can be scaled to support a variety of protocols, such as Graphite, collectd, and OpenTSDB.

Basic Concepts

INSERT machine_metric,cluster=Cluster-A,hostname=host-a cpu=10 1501554197019201823
  • Measurement: The concept of a measurement is similar to a metric of OpenTSDB, and it represents the name of the monitoring indicator for the data. For example, the example above is the monitoring of machine indicators, so its Measurement is named machine_metric.
  • Tag: Similar to the concept of tags in OpenTSDB, tags are used to describe the different dimensions of the subject. One or more tags are allowed, and each tag is also composed of a tag key and a tag value.
  • Field: In the logical data model of OpenTSDB, a row of metric data corresponds to a value. In InfluxDB, one row of measurement data can correspond to multiple values, and each value is distinguished by field.
  • Timestamp: A required attribute of time series data. It represents the time point for the data. You can see that the time granularity of InfluxDB can be accurate to nanoseconds.
  • TimeSeries: A combination of Measurement + Tags, which is called TimeSeries in InfluxDB. TimeSeries is the time series. A certain time point can be located based on time, so a certain value can be located using TimeSeries + Field + Timestamp . This is an important concept and is mentioned in subsequent sections.
SELECT * FROM "machine_metric" WHERE time > now() - 1h;  
SELECT * FROM "machine_metric" WHERE "cluster" = "Cluster-A" AND time > now() - 1h;
SELECT * FROM "machine_metric" WHERE "cluster" = "Cluster-A" AND cpu > 5 AND time > now() - 1h;

TSM

  1. Data sharding: Data is divided into different shards according to different time ranges. Time series data writes are generated linearly over time, so the generated shards also increases linearly over time. Writes are usually made in the latest partition, without being hashed to multiple shards. The advantage of sharding is that the physical deletion of retention is very simple. You can simply delete the entire shard. The disadvantage is that the precision of retention is relatively large, which is the whole shard, while the time granularity of retention depends on the time span of the shard. Sharding can be implemented either at the application layer or at the storage engine layer. For example, a column family of RocksDB can be used as a data shard. InfluxDB adopts this model, and the data under the default retention policy forms a shard in a 7-day time span.
  2. TTL: The bottom-layer data engine directly provides the function of automatic data expiration. You can set a time to live for each data entry, and the storage engine automatically deletes the physical data when the time is reached. The advantage of this method is that the precision of retention is very high, reaching the second-level and row-level of retention. The disadvantage is that the physical deletion occurs at the time of compaction on the implementation of LSM, which is less timely. RocksDB, HBase, Cassandra and Alibaba Cloud Table Store all provide TTL functions.
  1. Write Ahead Log (WAL): Data is written to WAL first, and then flows into the memory-index and the cache. Data written to WAL is flushed to the disk synchronously to ensure data persistence. The data in the cache is asynchronously flushed into the TSM file. If the process crashes before the data in the cache is persisted to the TSM File, the data in the WAL is used to restore the data in the cache, this behavior is similar to that of LSM.
  2. Cache: The cache of TSM is similar to the MemoryTable of LSM. The internal data is the data in the WAL that is not persisted to the TSM file. If a failover occurs for the process, the data in the cache is rebuilt based on the data in WAL. The data in the cache is stored in a SortedMap, and the Key of the map is composed of TimeSeries + Timestamp. Therefore, the data in the memory is organized by TimeSeries, and the data in TimeSeries is stored in time sequence.
  3. TSM files: The TSM file is similar to LSM’s SSTable. The TSM file consists of four parts: header, block, index, and footer. The most important parts are block and index:
  • Block: Each block stores the values of a TimeSeries over a period of time, that is, all the values of a field corresponding to a tag set of a certain measurement in a certain period of time. Different compaction policies are adopted within the block based on the types of different values of the field to achieve the optimal compaction efficiency.
  • Index: The index information in the file stores the location information of all data blocks under each TimeSeries. The index data is sorted according to the lexicographic order of the TimeSeries keys. The complete index data, which is very large, is not loaded into memory. Instead, only some keys are indexed, which is called indirectIndex. The indirectIndex contains some auxiliary positioning information, such as the minimum and maximum time, and the minimum and maximum keys in the file. The most important is that the file offset information of some keys and its index data is saved. To locate the index data of a TimeSeries, you need to first find the most similar index offset based on some Key information in the memory, scan the file contents sequentially from the starting point, and then locate the index data position of the key accurately.
  • LevelCompaction: InfluxDB divides the TSM file into 4 levels (Level 1–4). The compaction only occurs within files of the same level. Files of the same level are promoted to the next level after being compacted. From this rule, based on the attributes of time series data generation, the higher the level, the earlier the data generation time and the lower the access heat. The TSM file generated by the data in the cache for the first time is called the snapshot. The Level1 TSM file is generated after multiple snapshots are compacted, the Level2 TSM file is generated after the Level1 files are compacted, and so on. Different algorithms are used for the compaction of low-level and high-level files. The low CPU consumption method is used for the compaction of low-level files. For example, block-decompaction and block-merging are not performed. Block-decompaction and block-merging are performed for the compaction of high-level files to further improve the compaction ratio. I understand that this design is a trade-off. The comparison usually works in the background. To avoid affecting real-time data writing, the resources consumed by the compaction are strictly controlled, but the speed of the compaction is bound to be affected under the condition of limited resources. However, the lower the level, the newer and hotter the data is, which requires a compaction that accelerates the query faster. Therefore, InfluxDB adopts a compaction policy with low resource consumption at the lower level, which is completely designed according to the writing and query attributes of time series data.
  • IndexOptimizationCompaction: When Level4 files are accumulated to a certain number, the index becomes very large and the query efficiency becomes relatively low. The main factor for the low efficiency of the query is that the same TimeSeries data is contained by multiple TSM files, so data integration across multiple files is inevitable. Therefore, IndexOptimizationCompaction is mainly used to integrate data under the same TimeSeries into the same TSM file, to minimize the overlap ratio of the TimeSeries between different TSM files.
  • FullCompaction: After InfluxDB determines that a shard will not have data written for a long time, it performs a full compaction on the data. FullCompaction is the integration of LevelCompaction and IndexOptimization. After a full compaction, no more compactions will be performed for the shard, unless new data is written or deletion occurs. This policy is a collation for the cold data, mainly aimed at improving the compaction ratio.

Continuous Query

CREATE CONTINUOUS QUERY "mean_cpu" ON "machine_metric_db"
BEGIN
SELECT mean("cpu") INTO "average_machine_cpu_5m" FROM "machine_metric" GROUP BY time(5m),cluster,hostname
END

TimeSeries Index

Memory-Based Index

// Measurement represents a collection of time series in a database. It also
// contains in memory structures for indexing tags. Exported functions are
// goroutine safe while un-exported functions assume the caller will use the
// appropriate locks.
type Measurement struct {
database string
Name string `json:"name,omitempty"`
name []byte // cached version as []byte
mu sync.RWMutex
fieldNames map[string]struct{}
// in-memory index fields
seriesByID map[uint64]*Series // lookup table for series by their id
seriesByTagKeyValue map[string]map[string]SeriesIDs // map from tag key to value to sorted set of series ids
// lazyily created sorted series IDs
sortedSeriesIDs SeriesIDs // sorted list of series IDs in this measurement
}
// Series belong to a Measurement and represent unique time series in a database.
type Series struct {
mu sync.RWMutex
Key string
tags models.Tags
ID uint64
measurement *Measurement
shardIDs map[uint64]struct{} // shards that have this series defined
}
  • Key: the serialized string of the corresponding measurement + tags.
  • Tag: all the tag keys and tag values under the Timeseries.
  • ID: a unique integer ID.
  • measurement: the measurement to which the Series belongs.
  • shardIDs: a list of all ShardIDs that contain the Series.
  • seriesByID: a map that queries the Series through the SeriesID.
  • seriesByTagKeyValue: A two-layer map. The first layer is all tag values corresponding to the tag key, and the second layer is the ID of all Series corresponding to the tag values. As you can see, when the base of the TimeSeries becomes large, this map takes up quite a lot of memory.
  • sortedSeriesIDs: a list of sorted SeriesIDs.
  • The base of TimeSeries is limited mainly by the memory size. If the number of TimeSeries exceeds the upper limit, the entire database is unavailable. This type of problem is generally caused by the incorrect tagkey design. For example, a tag key is a random ID. Once this problem occurs, recovery is difficult. You can only manually delete the data.
  • If the process restarts, it takes a long time to recover the data, because the full TimeSeries information needs to be loaded from all TSM files to build an index in the memory.

Disk-Based Index

Summary

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

VS-Code Plugin for WSO2-Identity Server(IS)

The Most Lightweight New Mock Tool of Alibaba for Unit Testing Is Open-Source!

Why Agility Is the Best Ability

Git folder relocation, part 1

Understanding CSS3 Animations

First Steps with Agile

Fundamentals of Dart Streams

Integrating Apache Hudi and Apache Flink for New Data Lake Solutions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

How to setup Airflow Sensor’s mode as Reschedule

How can I manage my Kafka Artefacts?

Community Star Series | 1 Don’t know how to use Apache DolphinScheduler?