Testing the Performance of PipelineDB for Real-Time Statistics on Virtual Machines

By Digoal


PipelineDB is a type of streaming relational database developed based on PostgreSQL (PipelineDB 0.8.1 is developed based on PostgreSQL 9.4.4). This type of database features automatic processing of streaming data. Instead of raw data, PipelineDB only stores the processed data, so it is very suitable for common real-time streaming data processing scenarios.

Such scenarios include website traffic statistics, monitoring statistics for IT services, App Store access statistics, real-time statistics for IoT sensor data, and real-time statistics for logistics orders.

Introduction to PipelineDB

PipelineDB is based on, and is wire compatible with, PostgreSQL 9.4 and has added functionality including continuous SQL queries, probabilistic data structures, sliding windowing, and stream-table joins.

According to PostgreSQL’s website,

“PipelineDB’s fundamental abstraction is what is called a continuous view. These are much like regular SQL views, except that their defining SELECT queries can include streams as a source to read from. The most important property of continuous views is that they only store their output in the database. That output is then continuously updated incrementally as new data flows through streams, and raw stream data is discarded once all continuous views have read it.”

Streaming Statistics Example

Create a continuous view without defining a table, which is similar to NoSQL.

Activate the continuous view.

Write data into streams.

If you do not want to receive the streaming data, stop the streams.

Query the continuous view.

How to Deploy PipelineDB

Install PipelineDB.

Configure PipelineDB.

Initialize the database.

Configure stream processing parameters, such as setting the memory size, enabling or disabling synchronization, or setting the merged batch and number of worker processes.

Activate the database. As you can see, the native database supports PostgreSQL.

Check which functions are added to PipelineDB. Some functions are added as plug-ins, such as PostGIS. Some functions can be used as references or used directly.

Conduct Performance Testing on a Virtual Machine (VM) on Your Own Laptop

Create five continuous views. A continuous view is a view for which you do not need to create a base table.

Activate stream statistics.

View the data dictionary.

Conduct the batch insert test.

The following provides the test result. Note that you must use “simple” or “extended” here. If you use “prepared”, only the last SQL statement takes effect. It is not clear yet whether this is a PipelineDB or pgbench bug.

About 0.58 million records are written to the database every second, and the statistics for five continuous views are compiled.

All these operations are completed in-memory, so the speed is very fast. PipelineDB uses the worker process to merge data. The execution result of the top command during stress testing is as follows:

After one billion streaming data records are written, the database size is still only 13 MB. This is because the streaming data is located in the memory and discarded after being processed.

If your application has a similar scenario, this is the best solution.

Data from a Test Conducted on the Physical Machine

The results of the test conducted on the E5–2650 are as follows.

In this test, 10 PipelineDB instances are deployed and the preceding cases are executed. About 6 million data records are processed per second.

Combined with LVS, HAproxy, or JDBC LB, you can conduct large-scale real-time processing. This also indicates that the single-server performance of PipelineDB still has a lot of room for improvement.

Earlier versions of PipelineDB could not fully utilize the CPU. In the latest version, you do not need to deploy multiple instances, as you need only one instance for the entire CPU.


Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.