Table Store Time Series Data Storage — Architecture

Background

With IoT development in recent years, there has been an explosion in time series data. Based on growth trends of database types collected in DB-Engines over the past two years, there has been a rapid growth of time series databases. Implementations of these large open source time series databases differ, and none of them are perfect. However, the strengths of these databases can be combined to implement a perfect time series database.

What Is Time Series Data?

Time Series Data Model

Before defining the time series data model, we first make an abstract representation of the time series data.

  • Individual or group (WHO): Describes the subject that produces the data. This subject can be a person, a monitoring metric, or an object. It generally describes that an individual has multi-dimensional attributes, and a certain unique ID can be used to locate an individual. For example, using a person’s ID to locate a person, and using a device ID to locate a device. Individuals can also be located through multi-dimensional attributes. For example, by using clusters, machine IDs, and process names to locate a process.
  • Time (WHEN): Time is the most important feature of time series data and is a key attribute that distinguishes it from other data.
  • Location (WHERE): A location is usually located using a two-dimensional coordinate of latitude and longitude; and by a three-dimensional coordinate of latitude, longitude and altitude in fields related to scientific computing, such as meteorology.
  • Status (WHAT): Used to describe the status of a specific individual at a certain moment. The time series data for monitoring is usually a numerical description status, while trace data uses an event-expressed status in which there are different expressions for different scenarios.
  • Metric: Used to describe the monitoring metric.
  • Tags: Used to locate the monitored object, which is described using one or more tags.
  • Timestamp: The time point when the monitoring value is collected.
  • Value: The collected monitoring value, which is usually numeric.
  • Name: Defines the type of the data.
  • Tags: Describes the metadata of the individual.
  • Location: The spatio-temporal information of the data.
  • Timestamp: The timestamp when the data is generated.
  • Values: The value or status corresponding to the data. Multiple values or statuses can be provided, which do not necessarily have to be numeric.

Time Series Data Query, Computing, and Analysis

Time series data has its own specific query and computing methods, which roughly include the following types:

Time Series Data Processing Procedure

  • Data model: For the standard definition of time series data, the collected time series data must conform to the definition of the model, including all the characteristic attributes of the time series data.
  • Stream computing: Pre-aggregation, downsampling, and post-aggregation for the time series data.
  • Data storage: The storage system provides high-throughput, massive volume, and low-cost storage, and supports separation of cold/hot data, as well as efficient range query.
  • Metadata retrieval: Provides the storage and retrieval of timeline metadata in the order of tens of millions to hundreds of millions, and supports different retrieval methods (multidimensional filtering and location query).
  • Data analysis: Provides time series analysis and computing capabilities for time series data.

Open Source Time Series Databases

  • Data Storage: All the databases utilize distributed NoSQL (LSM engine) storage, including open source distributed databases, such as HBase and Cassandra, and cloud platforms such as BigTable, as well as self-developed storage engines.
  • Aggregation:Pre-aggregation is solely reliant on external stream computing engines, such as Storm or Spark Streaming. At the post-aggregation level, query for post-aggregation is an interactive process, so it is generally not reliant on the stream computing engine. Different time series databases provide a single-threaded simple method or a concurrent computing method. Automatic downsampling is also a post-aggregation process, but is a stream process instead of an interactive process. This computing is suitable for the stream computing engine, but is not implemented in this way.
  • Metadata storage and retrieval:The classic OpenTSDB does not have dedicated metadata storage, and does not support the retrieval of metadata. The metadata is obtained and queried by scanning the row keys of the data table. KairosDB uses a table for metadata storage in Cassandra, but retrieval efficiency is very low because the table needs to be scanned. The secondary development of Heroic was based on KairosDB. It uses Elasticsearch for metadata storage and indexing, and supports better metadata retrieval. InfluxDB and Prometheus implement indexing independently, but indexing is not easy, and requires timeline metadata in the order of tens of millions to hundreds of millions. In an earlier version, InfluxDB implemented an in-memory metadata index which is more restrictive. For example, the scale of the timeline is limited by the size of the memory, and the in-memory index structure has to scan all timeline metadata, causing a longer node failover time.
  • Data analysis: Except for Elasticsearch, most TSDBs are not equipped with analysis capabilities, aside from query and analysis capabilities for post-aggregation. This is an important advantage that allows Elasticsearch to keep a foothold in the field of time series analysis.

Table Store Time Series Data Storage

As a distributed NoSQL database developed by Alibaba Cloud, Table Store uses the same Wide Column data model as Bigtable. The product is well-suited to time series data scenarios in terms of storage model, data size, and write and query capabilities. We also support monitoring time series products such as CloudMonitor, status time series products such as AliHealth’s drug tracking, and core services such as postal package tracing. There is also a complete computing ecosystem to support the computing and analysis of time series data. In our future plans, we have specific optimizations for time series scenarios regarding metadata retrieval, time series data storage, computing and analysis, and cost reduction.

Summary

Table Store is a general-purpose distributed NoSQL database that supports multiple data models. The data models currently available include Wide Column (BigTable) and Timeline (message data model).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

4.97K Followers

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com