By Zhaofeng Zhou (Muluo)
What Is Time Series Data?
Time series data is a set of data indexed by time. Simply put, this type of data describes the measurements of a measured subject at each time point of a time range.
Modeling of time series data includes three important parts: subject, time point, and measurements. Applying this model, you will find that you are in constant contact with this type of data in your daily work and life.
- If you are a stockholder, the stock price of a stock is a type of time series data. It records the stock price of the stock at each time point.
- If you are O&M personnel, the monitoring data is a type of time series data. For example, the monitoring data of the CPU records the actual usage of the CPU at each time point.
The world is made up of data, and every object in the world is producing data all the time. The exploitation and use of these types of data is silently changing people’s lifestyles in this era. For example, the core of the personal health management feature of wearable devices is to continuously collect your personal health data. Such data includes heart rate and body temperature. After collecting such data, the device uses a model to calculate and evaluate your health status.
If you free your vision and your imagination, you will find that all data in your daily life can be exploited and used. Objects that generate data include your mobile phone, car, air conditioner, refrigerator, and so on. The core idea of the currently hot Internet of Things (IoT) technology is to build a network that collects data generated by all objects, and exploits the value of the collected data. Data collected by this network is typical time series data.
Time series data is used to describe the state change information of an object in the historical time dimension. The analysis of time series data is the process of trying to understand and master the rules of the changes. Time series data experiences explosive growth with the development of IoT, big data, and artificial intelligence (AI) technologies. To better support the storage and analysis of such data, a variety of database products have come into being and are available on the market. The invention of this kind of database products is to solve the shortcomings and drawbacks of the conventional relational databases in terms of time series data storage and analysis. These products are uniformly classified as time series databases (TSDBs).
As can be seen from the ranking of the most popular database management systems on DB-Engines, the popularity of TSDBs has maintained a high growth rate over the last two years.
Later on, I will write a few articles to analyze:
- The basic concepts of time series data, including models, characteristics, basic queries, and processing operations.
- The analysis of the underlying implementation of several popular open source TSDBs.
Characteristics of Time Series Data
The analysis of the characteristics of time series data will be explained from three dimensions: data writing, query, and storage. We will figure out the basic requirements for time series database by analyzing these characteristics.
Data Writing Characteristics
- Smooth, continuous, highly concurrent, and high throughput data writing: The writing of time series data is relatively stable, which is different from that of application data. The application data is usually proportional to the application’s page view, and usually has peaks and valleys. Time series data is usually generated at a fixed time frequency, and is not restricted by any other factors. The data generation speed is relatively stable. Time series data is generated independently by individual object. When we have a large number of individual objects, the data writing concurrency and throughput would be relatively high, especially in the IoT scenario. Write concurrency and throughput can be calculated simply by the number of individual objects and the frequency of data generation. For example, if you have 1000 objects that generate data every 10 seconds, the concurrency and throughput every second would be 100.
- More writes than reads: 95%-99% of the operations on time series data are writes. This is determined by the data characteristics. Taking monitoring data as an example, we may have a lot of monitoring items, but we are only concerned about a few of them. Usually, we only read data of several key metrics, or in specific cases.
- The most recent data is generated in real time and is not updated: Time series data is written in real-time, and the data written each time is the most recent data. This is determined by the data generation characteristics because such data is generated over time, and the new data is written in real time. Data is written and is not updated. In the time dimension, as time passes by, all data written is new. No existing data is updated, except for manual revision.
Characteristics of Data Query and Analysis
- Data is read by time range: Generally, we need data of a period, rather than data at a particular point. Therefore, time series data is basically read by the time range.
- The recent data has a higher possibility to be read: The more recent the data, the higher data read possibility. Taking monitoring data as an example, we usually only care about what happened in the last few hours or the last few days, rather than what happened a month or a year ago.
- Multi-precision query: The precision of time series data is distinguished by the density of data points. For example, if the time interval between two adjacent data points is 10 seconds, the precision of the time series data is 10 seconds. If the time interval is 30 seconds, then the precision is 30 seconds. The shorter the time interval, the higher the precision. The higher the precision, the more detailed and accurate the restored historical states can be. However, a higher precision also means more data points to be saved. It is very similar to the resolution of pictures. The higher the resolution, the more clear the picture, and the larger the file size. The query of time series data does not always need high precision. The query precision is jointly determined by the actual needs and the cost. Again, let’s take monitoring data as an example. Usually the monitoring data is displayed in the form of a line chart, which allows us to observe data changes with our own eyes. In this case, if the data points to be displayed for a unit length are too dense, they become inconvenient to observe, which is the actual need. Another trade-off is that when you query a longer time range such as one month, if you set the precision to 10 seconds, the query must return 259,200 data points. If the precision is 60 seconds, the query only needs to return 43,200 data points. This defines query efficiency. In terms of costs, the higher the precision, the greater amounts of data to be stored, and the higher costs. Generally, we do not need a high precision for historical data. In terms of data query and processing, the precision of data is generally determined by the length of time range. In terms of historical data storage, we would usually go for the downsampled data.
- Multidimensional analysis: Time series data is generated from different individual objects with different attributes. Such attributes may be of the same dimension or not. Again, let’s take monitoring data as an example. When I monitor the network traffic of each machine within a cluster, I may query for the network traffic of a particular machine. This can be taken as data query of one dimension. When I query for the network traffic of the entire cluster, it involves data query of another dimension.
- Data mining: With the development of big data and artificial intelligence technologies, the threshold for high-added value data mining is no longer so high, because data storage, computing resources, and cloud computing are highly developed. Time series data contains high value and is worth exploiting.
Data Storage Features
- Large amount of data: Let’s take the monitoring data as an example. If the time interval of the monitoring data we collect is 1s, then an individual monitoring item will generate 86,400 data points every day. This number simply adds up to 864,000,000 when we have 10,000 monitoring items. In the IoT scenario, this number will be even larger. The data size may be measured by TB or PB.
- Cold and hot data separation: Time series data has very typical hot and cold characteristics. The older the data, the less likely it is to be queried and analyzed.
- Timeliness: Time series data is time-sensitive. TSDBs usually have a retention window. Data beyond this retention window can be considered inactive and can be safely deleted. That’s because old data usually does not have much value, and we have to save the storage cost by deleting the low-value data.
- Multi-precision data storage: As mentioned before, we need multi-precision query in consideration of the storage cost and query efficiency of time series data. Actually, we also need multi-precision data storage.
Basic Requirements of TSDB
Based on the analysis on the above characteristics of time series data in terms of data writing, query, and storage, we can summarize the following basic requirements for TSDBs:
- The ability to support highly concurrent and high-throughput writes: As mentioned before, time-series data is more frequently written than read, with 95%-99% of operations being writes. Therefore, we should focus on a TSDB’s ability to write. In most cases, a TSDB must be able to support highly-concurrent and high-throughput data writes.
- Interactive aggregate query: The latency of interactive queries must be very low even when the queried data is enormous in size (measured in TB).
- The ability to store massive amounts of data: The data size is determined by the characteristics of the scenarios. In most cases, time series data is measured in TB, and even PB.
- High availability: Online services usually have high availability requirements.
- Distributed architecture: Considering the requirements of data writes and storage, the underlying layer must be the distributed architecture.
According to the analysis of the characteristics of time series data and the basic requirements for TSDB, NoSQL databases that use LSM-tree-based storage engines (such as HBase, Casandra, and Alibaba Cloud TableStore) have significant advantages over databases that use B+ tree-based relational database management systems (RDBMSs). The basic theory of the LSM tree is not described here. LSM tree is designed to optimize the write performance. The write performance of LSM-tree-based TSDBs is ten times higher than that of B+ tree-based TSDBs. However, their read performance is far poorer than that of B+ tree-based TSDBs. Therefore, LSM-tree-based TSDBs are particularly suitable for scenarios with more writes than reads. Currently, among the several well-known open source TSDBs, OpenTSDB uses HBase as the underlying storage engine, BlueFlood and KairosDB use Cassandra, InfluxDB uses the self-developed TSM storage engine which is similar to LSM, and Prometheus directly uses the LevelDB-based storage engine. We can see that all mainstream TSDBs use the LSM-tree-based distributed architecture for underlying storage. The difference is that some products directly use the existing mature databases, and some use the self-developed or LevelDB-based databases.
The LSM-tree-based distributed architecture can easily meet the writing requirements of time series data, but it is rather weak in terms of data query. These databases can meet the needs for multi-dimensional aggregate query of a small amount of data. However, for multi-dimensional aggregate query of a large amount of data without indexes, their performance is rather poor. Therefore, in the open source world, there are some other products that focus on solving such query and analysis problems. For example, Druid mainly focuses on solving OLAP requirements for time series data, and allows fast query analysis of massive amounts of data without pre-aggregation. It also supports drilling down on any dimensions. Our community also provides ElasticSearch-based solution for analysis-oriented scenarios.
In short, the diversified TSDBs come with their own benefits and drawbacks. There is no best solution that works for all scenarios. You can only choose the one that best fits your service needs.
Time Series Data Models
A data model of time series data mainly consists of the following parts:
- Subject: The subject to be measured. A subject may have attributes with multiple dimensions. Taking server status monitoring as an example, the measured subject is a server, and its attributes may include the cluster name, host name, and so on.
- Measurements: A subject may have one or more measurements, each corresponding to a specific metric. In the case of the server status monitoring, the measured metrics may include CPU usage and IOPS. The value of CPU usage is a percentage, and the value of IOPS is the number of I/Os during the measurement period.
- Timestamp: The measurement report is always attached with a timestamp attribute to indicate the time.
Currently, mainstream TSDBs use two modeling methods: modeling by data source and modeling by metrics. I will use two examples to illustrate the difference between these two methods.
Modeling by Data Source
The above is an example of modeling by data source. Measurements of all metrics of the same data source at a certain time point are stored in the same row. This model is used by Druid and InfluxDB.
Modeling by Metrics
The above is an example of modeling by metrics, where each row of data represents a measurement of a certain metric of a data source at a certain time point. This mode is used by OpenTSDB and KairosDB.
There is no clear distinction between these two models. If the underlying layer architecture adopts columnar storage and there is an index on each column, modeling by data source may be better. If the underlying layer architecture is similar to HBase or Cassandra, storing multiple metric values on the same row may affect the query or filter efficiency on one of the metrics. Therefore, we typically choose to model by metrics.
Processing of Time Series Data
This section mainly describes the processing of time series data. In addition to the basic data writing and storage, query and analysis are the most important features of a TSDB. The processing of time series data mainly includes filter, aggregation, GroupBy, and downsampling. To better support GroupBy queries, some TSDBs will pre-aggregate the data. Downsampling is done through rollups. To support faster and more real-time rollups, TSDBs usually support auto-rollups.
The above is a simple filter process. Simply put, it queries for all data that meets the given conditions of different dimensions. In the scenario of time series data analysis, filter usually starts from a high dimension, and then performs more detailed query and processing of data based on more-refined dimensional conditions.
Aggregation is the most basic function of time series data query and analysis. Time series data records the original state change information. However, when doing time series data query and analysis, we usually do not need the original information. Instead, we need the statistics based on the original information. Aggregation involves some basic computations for statistics. The most common computations are SUM, AVG, Max, and TopN. For example, when analyzing the server traffic, you would care about the average amount of traffic, the total amount of traffic, or the peak traffic.
GroupBy and Pre-Aggregation
GroupBy is the process of converting low-dimensional time series data into high-dimensional statistics. The above is a simple example of GroupBy. GroupBy is usually performed during query. After the original data is queried, we obtain the result through real-time computation. This process may be very slow, depending on the size of the originally queried data. Mainstream TSDBs optimize this process through pre-aggregation. After the data is written in real time, it will be pre-aggregated to generate the results after GroupBy according to the given rules. This allows us to directly query the results without re-computation.
Downsampling, Rollup and Auto-Rollup
Downsampling is the process of converting high-resolution time series data into low-resolution time series data. This process is called rollup. It is similar to GroupBy, but they are different. GroupBy is to aggregate data of different dimensions at the same time level based on the same time granularity. The time granularity of the converted data remains the same, but the dimension becomes higher. Downsampling is to aggregate data of the same dimension at different time levels. The time granularity of the converted data becomes coarser, but the dimension remains the same.
The above is a simple example of downsampling, which aggregates the 10-second resolution data to 30-second resolution data, to obtain the statistical average.
Downsampling is divided into storage downsampling and query downsampling. Storage downsampling is to reduce storage costs of data, especially historical data. The query downsampling is mainly for queries with a larger time range to reduce the returned data points. Auto-rollup is required for both storage downsampling and query downsampling. Auto-rollup automatically performs a data rollup, rather than when it’s waiting for a query. Similar to pre-aggregation, this process can effectively improve query efficiency. It is also a feature that has been or plans to be designed for the currently mainstream TSDBs. Currently, Druid, InfluxDB, and KairosDB support auto-rollup. OpenTSDB does not support auto-rollup, but it provides an API to support the import of results after auto-rollup is performed externally.
This article mainly analyzes the characteristics, models and basic query and processing operations of time series data, reveals the basic requirements for TSDBs. In the next article, we will analyze the implementation of several popular open-source TSDBs. You may find that although there are many TSDBs, they have similar basic functions. All TSDBs have their own characteristics and implementation methods, but they are all designed based on the trade-off from such dimensions as writing, storage, query, and analysis of time series data. There is no one-size-fits-all TSDB that can solve all potential problems. It’s important to choose the most suitable TSDB from a business perspective.