Probabilistic Data Structures and Algorithms in PipelineDB

Alibaba Cloud
6 min readOct 28, 2019

--

By Sangli.

PipelineDB uses five aggregate data types when it comes to physical storage. These data types play an important role in processing continuous data. In this article, we will take a look at what these data types are-along with their structures and algorithms-and then we will go through how PipelineDB uses each one of these data types.

Of course, you can learn more about the PipelineDB’s data types, data structures, and algorithms by checking out the official documentation. The relevant page for it is this one. Here, we’re going to try to keep things a bit more simple.

So first, let’s quickly go over the five data types.

  • Bloom Filter: a space-optimized data structure designed to estimate set cardinalities and determine whether these set cardinalities contain a particular kind of element.
  • Count-Min (CM) Sketch: a data structure that, similar to Bloom Filter, estimates the frequency of each element that has been added.
  • Filtered-Space Saving (FSS) Top-K: a data structure and algorithm set that is useful for accurately estimating the top-k most frequent values appearing in a stream when using a constant, minimal memory footprint.
  • HyperLogLog (HLL): a data structure and algorithm set that, similarly to Bloom filters, estimates the cardinality of sets with a very high degree of accuracy.
  • T-Digest: a data structure that supports very accurate estimations of rank-based statistics such as percentiles and medians while only using a constant amount of space.

Later in this article, we will go over how PipelineDB uses these data types in the sections that follow this one.

Of course, for all of these, you can learn more from reading the official documentation, specifically these particular places: PipelineDB-specific Types and PipelineDB-specific Functions.

You can also find the corresponding data types and functions on these websites. These data types can be used in common tables if you do not want to use them in Continuous Views (CVs), as we will do in this article.

The specific code is implemented in src/backend/utils/adt/. Okay, now with all of that out of the way, let's look at how PipelineDB uses all of the above data types.

Bloom Filter

So let’s start with Bloom Filter. In PipelineDB, Bloom Filter is implemented in src/backend/pipeline/bloom.c. Consider the following:

The Bloom Filter data type is used very frequently and is a built-in data type of PipelineDB. We can use Bloom to store data in Continuous Views (CVs).

To illustrate this point, consider the following example. When creating a CV, you can use UDFs directly to generate the Bloom Filter-type fields.

After doing that, you’ll want to insert some test data, as shown below:

There are also many UDFs available at this page.

Count-Min Sketch

Now there’s Count-Min Sketch. In PipelineDB, a Count-Min (CM) Sketch is implemented in src/backend/pipeline/cmsketch.c.

As discussed before, a CM Sketch estimates the frequency of an element in a large stream, which is similar to a Bloom filter. However, a Bloom filter records whether the element exists or not. An error exists when you use the cm sketch.

Now consider the following example:

To this, let’s insert some test data, as shown below:

To learn more about this, you can check out its description in the PipelineDB’s official documentation.

Filtered-Space Saving (FSS) Top-K

In PipelineDB, Filtered-Space Saving (FSS) Top-K is implemented in src/backend/pipeline/fss.c.

FSS Top-K is used to find the top-k elements in data streams. Our implementation that we show here is based on this paper.

In an actual environment, for example, where there may be computing top-10 elements in the last 5 minutes, FSS Top-K can efficiently save time and memory space. The following shows, more particularly, how FSS Top-K is used in PipelineDB:

Now, into this, insert some test data, as shown below:

Now, insert six data records, and query v_fss, as done below:

After this, five records are found when fss_topk is used. In each record, the value is 1. The inserted value 6 is missing. However, the output of fss_topk_values contains all the values, and this output is the number of each element.

Now, insert a record again, as shown below:

Two records contain the value 5. Now, insert the value 6.

The value 6 appears and takes the place of 4.

If you’d like to learn more, check out the description provided in PipelineDB’s official documentation.

HyperLogLog

In PipelineDB, HyperLogLog (HLL) is implemented in src/backend/pipeline/hll.c.

In most practical applications, HyperLogLog (HLL) is used to solve the count (distinct) problem, but in PipelineDB, it is also used to store the count (distinct) data.

Consider this example:

The physical storage format of the count (distinct) field corresponding to the created v_hll is HLL. In the query process, the parser parses this field into the following function:

hll_count_distinct_final

The following shows the execution plan:

Here’s an example:

Regardless of the number of insert operations, the result is count (distinct).

In PipelineDB versions earlier than 0.9.7, this has a rather mediocre efficiency with scan operations sometimes negatively affect performance. In PipelineDB 0.9.7, however, this function has been improved significantly, mitigating these issues. To learn more about the releases and versions of PipelineDB, check out this page. This issue is also discussed at detail in this article, too.

T-Digest

Last, there’s T-Digest. In PipelineDB, T-Digest is implemented in src/backend/pipeline/tdigest.c.

As discussed above, T-Digest is mainly used for estimation and calculation based on percentiles and medians. It is similar to the built-in function percentile_cont in the database. This function is seldom used.

The following briefly introduces how to use T-Digest in PipelineDB.

First, consider the following example:

Now, in it, insert data as follows:

For it, the median of 0.5 is 2, and the median of 0.6 is 2.3.

If you’re interested in learning more about T-Digest, I recommend that you check out its official description in the PipelineDB documentation. In it, you’ll learn how to use probabilistic data structures and algorithms.

Original Source

--

--

Alibaba Cloud
Alibaba Cloud

Written by Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

No responses yet