Table Store Time Series Data Storage Architecture

What Is Time Series Data?

Time Series Data Model

  1. Individual or group (WHO): Describes the subject that produces the data, which can be a person, a monitoring metric, or an object. It generally describes that an individual has multi-dimensional attributes, and a certain unique ID can be used to locate an individual, for example, using a person ID to locate a person, and using a device ID to locate a device. Individuals can also be located through multi-dimensional attributes, for example, using cluster, machine ID, and process name to locate a process.
  2. Time (WHEN): Time is the most important feature of time series data and is a key attribute that distinguishes it from other data.
  3. Location (WHERE): A location is usually located by a two-dimensional coordinate of latitude and longitude, and by a three-dimensional coordinate of latitude, longitude and altitude, in the field of scientific computing such as meteorology.
  4. Status (WHAT): Used to describe the status of a specific individual at a certain moment. The time series data for monitoring is usually a numerical description status, and the trace data is an event-expressed status, in which there are different expressions for different scenarios.
  1. Metric: Used for describing the monitoring metric.
  2. Tags: Used for locating the monitored object, which is described by one or more tags.
  3. Timestamp: The time point when the monitoring value is collected.
  4. Value: The collected monitoring value, which is usually numeric.
  1. Name: Defines the type of the data.
  2. Tags: Describes the metadata of the individual.
  3. Location: The location of the data.
  4. Timestamp: The time stamp when the data is generated.
  5. Values: The value or status corresponding to the data. Multiple values or statuses can be provided, which are not necessarily numeric.

Time Series Data Query, Computing, and Analysis

Time Series Data Processing Procedure

  1. Data model: For the standard definition of time series data, the collected time series data must conform to the definition of the model, including all the characteristic attributes of the time series data.
  2. Stream computing: Pre-aggregation, downsample, and post-aggregation for the time series data.
  3. Data storage: The storage system provides high-throughput, massive, and low-cost storage, supporting separation of cold/hot data, and efficient range query.
  4. Metadata retrieval: Provides the storage and retrieval of time series metadata in the order of 10 million to 100 million, and supports different retrieval methods (multidimensional filtering and location query).
  5. Data analysis: Provides time series analysis and computing capabilities for time series data.

Open Source Time Series Databases

  1. Data Storage: All the databases utilize distributed NoSQL (LSM engine) storage, including distributed databases such as HBase and Cassandra, and cloud products such as BigTable, and self-developed storage engines.
  2. Aggregation: Pre-aggregation can only rely on external stream computing engines, such as Storm or Spark Streaming. At the post-aggregation level, query for post-aggregation is an interactive process, so it generally does not rely on the stream computing engine. Different time series databases provide a single-threaded simple method or a concurrent computing method. Automatic downsample is also a post-aggregation process, but is but a stream process instead of an interactive process. This computing is suitable for the stream computing engine, but is not implemented in this way.
  3. Metadata storage and retrieval: The classic OpenTSDB does not have a dedicated metadata storage, and does not support the retrieval of metadata. The metadata is retrieved and queried by scanning the row keys of the data table. KairosDB uses a table for metadata storage in Cassandra, but retrieval efficiency is very low, because the table needs to be scanned. Heroic was developed based on KairosDB. It uses Elasticsearch for metadata storage and indexing, and supports better metadata retrieval. InfluxDB and Prometheus implement indexing independently, but indexing is not easy, and requires time series metadata in the order of 10 million to 100 million. InfluxDB implemented a memory-based metadata indexing in an earlier version, which has a number of restrictions, for example, the size of the memory will limit the size of the time series, and the construction of the memory index needs to scan all time series metadata, causing a long failover time for the node.
  4. Data analysis: Most TSDBs do not naturally have analysis capabilities other than query and analysis capabilities for post-aggregation, except for Elasticsearch, this is an important advantage for it to keep a foothold in the time series field.

Table Store Time Series Data storage





Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Creating an Ecosystem for Redevelopment with Alibaba Cloud DataWorks

Pandas for Data Engineers — Part I

Finding Business Value in Simple Models

Safety Search

Introduction to Data Engineering

How to increase sales and improve customer satisfaction?

Using Markov Chains to Predict the End of Lockdown in Shanghai

My DSN Impact Story; Becoming a Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:

More from Medium

Building Event Driven Applications with Apache Flink, Apache Kafka and Amazon EMR — Part 1

Stream avro data from kafka over ssl to Apache pinot

How we implemented Pod Logging at NetBook

Apache Cassandra