PostgreSQL Time-Series Database Plug-in TimescaleDB Deployment Practices


In the real world, the data generated by many businesses have the attribute of time-series data (that is, the data is written sequentially in time dimension, and a large number of requests for time-interval query statistics are also included).

For example, FEED data of the business, time-series data generated by Internet of Things (such as weather sensors and vehicle trajectories) and real-time data from the financial industry.

PostgreSQL UDF and BRIN (block-level index) are ideal for processing time-series data. Specifically, see the two following examples.

Implementation of On-demand Slicing in PostgreSQL — plpgsql Schemaless Implementation of the Automatic Slicing Feature of the TimescaleDB Plug-in

PostgreSQL Time-series Best Practices — Design a Stock Exchange System Database — Alibaba Cloud RDS for PostgreSQL Best Practices

In fact, in PostgreSQL ecology, a time-series plug-in named TimescaleDB has been derived, which is specially used to process time-series data. (Timescale improvements, including improvements to the SQL optimizer (it supports “merge append”, and time shard aggregation is very efficient), rotate interface, and automatic slicing)

Many investors are also interested in TimescaleDB and it has already received an investment of USD 50 million, which indirectly indicates that the time-series database will be very popular with users in the future.

The Advantage of TimescaleDB

First, TimescaleDB is automatically sharded and has no influence from the users’ perspective. When the amount of data is very large, the write performance does not deteriorate. (This mainly refers to disks with lower IOPS. For disks with better IOPS, PG performs OK after writing a large amount of data.)

Secondly, Timescale improves SQL optimizer and adds the execution node of “merge append”. When “group by” is performed on small time shards, it does not need to perform HASH or GROUP operation on the entire timestamp range, but instead performs calculation on shards, which makes it very efficient.

Finally, some APIs have been added to Timescale, making it very efficient for users to write, maintain, and query time-series data, and very easy to maintain the data.

These APIs are as follows:

Deploy TimescaleDB

Take CentOS 7.x x64 as an example.

1. Install PostgreSQL

Please see PostgreSQL on Linux Best Deployment Manual

2. Install cmake3

3. Compile TimescaleDB

4. Configure postgresql.conf to automatically load the timescale lib library when the database is started

5. Create plug-ins for databases that need to use TimescaleDB

6. Parameters related to TimescaleDB

TimescaleDB usage example 1 — perspective analysis of New York taxi data

The first example is the actual New York city taxicab data,

The data is real, taken from New York city taxi cabs,

1. Download sample data

2. Extract

3. Create a table, which involves using the create_hypertable API to convert ordinary tables into time-series storage tables.

Some of the truncated nyc_data.sql content is as follows:

This sentence converts the “rides” table into a time-series storage table

Create an index

4. Import test data

5. Execute some test SQL on the “rides” table that has been converted to a time-series storage table, the performance of which is better than PostgreSQL ordinary tables.

What is the average charge for transactions with more than two passengers per day?

6. The performance of some queries is even more than 20 times better

How many transactions are there every day?

Timescale adds the execution optimization of “merge append”, so it is highly efficient to aggregate by small granularity on time shards. The more data, the more obvious the difference in performance improvement.

For example, TimescaleDB introduces a time-based “merge append” optimization to minimize the number of groups which must be processed to execute the following (given its knowledge that time is already ordered).

For our 100M row table, this results in query latency that is 396x faster than PostgreSQL (82ms vs. 32566ms).

7. Execute some functions specific to TimescaleDB, such as time_bucket, and some acceleration algorithms built into TimescaleDB is also used here.

Every 5-minute interval is a BUCKET, which produces the number of orders generated in each interval.

8. Execute some statistical analysis SQL

The volume of taxi transactions in each city.

Statistics of taxi rides in some cities in January 2016 (including longest distance, shortest distance, average number of passengers, and hours)

9. Automatic data sharding and run plan

10. You can also check the shards directly

Slices Are Completely Transparent to Users

Sliced metadata:

Timescaledb + Postgis Combination — Spatial-Temporal Database

The time-series database timescaleDB plug-in is combined with the spatial-temporal database PostGIS plug-in. PostgreSQL is very good at handling spatial data.

1. Create a spatial database PostGIS

2. Add a spatial type field

3. Update the data to the geometry field (It is actually stored as two automatic fields, representing longitude and latitude respectively. In fact, it does not matter whether it is updated or not, because PG supports expression indexes, and you can use these two fields to create expression spatial indexes.)

4. Examples of Spatio-Temporal Analysis.

How many cars are called every 30 minutes within 400 meters of (lat, long) (40.7589,-73.9851).

Example 2 — Sensor Data and Weather Data

No more details are given here.

Common APIs for TimescaleDB

1. Create a Time-Series Table


Required Arguments

NameDescriptionmain_tableIdentifier of table to convert to hypertabletime_column_nameName of the column containing time values

Optional Arguments

NameDescriptionpartitioning_columnName of an additional column to partition by. If provided, number_partitions must be set.number_partitionsNumber of hash partitions to use for partitioning_column when this optional argument is supplied. Must be > 0.chunk_time_intervalInterval in event time that each chunk covers. Must be > 0. Default is 1 month.create_default_indexesBoolean whether to create default indexes on time/partitioning columns. Default is TRUE.if_not_existsBoolean whether to print warning if table already converted to hypertable or raise exception. Default is FALSE.partitioning_funcThe function to use for calculating a value’s partition.

2. Add a Multi-Level Sharded Field

Hash and interval shards are supported


Required Arguments

NameDescriptionmain_tableIdentifier of hypertable to add the dimension to.column_nameName of the column to partition by.

Optional Arguments

NameDescriptionnumber_partitionsNumber of hash partitions to use on column_name. Must be > 0.interval_lengthInterval that each chunk covers. Must be > 0.partitioning_funcThe function to use for calculating a value’s partition (see create_hypertable instructions).

3. Delete a Shard

Delete shards before the specified time


Required Arguments

NameDescriptionolder_thanTimestamp of cut-off point for data to be dropped, i.e., anything older than this should be removed.

Optional Arguments

NameDescriptiontable_nameHypertable name from which to drop chunks. If not supplied, all hypertables are affected.schema_nameSchema name of the hypertable from which to drop chunks. Defaults to public.cascadeBoolean on whether to CASCADE the drop on chunks, therefore removing dependent objects on chunks to be removed. Defaults to FALSE.

4. Set the Time Interval of Shards


Required Arguments

NameDescriptionmain_tableIdentifier of hypertable to update interval for.chunk_time_intervalInterval in event time that each new chunk covers. Must be > 0.

5. Analysis Function — the First Record


Required Arguments

NameDescriptionvalueThe value to return (anyelement)timeThe timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

For example, find the earliest uploaded temperature values for all sensors.

This can also be done by using recursive SQL:

Applications of PostgrSQL Recursive SQL — Geeks and Normal People

6. Analysis Function — the Last Record


Required Arguments

NameDescriptionvalueThe value to return (anyelement)timeThe timestamp to use for comparison (TIMESTAMP/TIMESTAMPTZ or integer type)

For example, find the latest temperature value of each sensor every 5 minutes.

This can also be done by using recursive SQL:

Applications of PostgrSQL Recursive SQL — Geeks and Normal People

7. Analysis Function — Histogram


Required Arguments

NameDescriptionvalueA set of values to partition into a histogramminThe histogram’s lower bound used in bucketingmaxThe histogram’s upper bound used in bucketingnbucketsThe integer value for the number of histogram buckets (partitions)

For example:

The battery level of 20 to 60 is divided into FIVE BUCKET intervals, and an array of 5 + 2 values (representing the number of records in each bucket interval) is returned. The two values at the beginning and the end indicate how many records are outside the boundary.

8. Analysis Function — Time Interval

This is similar to date_trunc, but it is more powerful and can be truncated with any interval. It is easy for users to use.


Required Arguments

NameDescriptionbucket_widthA PostgreSQL time interval for how long each bucket is (interval)timeThe timestamp to bucket (timestamp/timestamptz/date)

Optional Arguments

NameDescriptionoffsetThe time interval to offset all buckets by (interval)

9. The View Function for Data Overview — Time-Series Table Overview


10. The View Function for Data Overview — Shard Size


11. View Function for Data Overview — Index Size


12. Export Time-Series Metadata


TimescaleDB is a very useful time-series data processing plug-in, hiding the shard logic (it is transparent to users) and providing a large number of API function interfaces and performance optimization. It is great for time-series scenarios.

Combined with PostGIS plug-in, PostgreSQL is more powerful in spatio-temporal processing.


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.