Data Sampling in PostgreSQL

Image for post
Image for post

Background

The online database is large. Therefore, we usually build a test database using data sampling. In addition to evenly extracting data, data sampling also encrypts data by hiding or encrypting some sensitive fields.

PostgreSQL 9.5 and later versions support the TABLESAMPLE syntax. For more information, see the following webpages:

For a version earlier than 9.5, customize functions to implement data sampling.

Example

Use functions for data sampling in PostgreSQL versions earlier than 9.5. PostgreSQL v.9.5 and later versions provide the SQL syntax for data sampling.

1) Specify the schema, table name, number of records to be sampled, and dispersion of sampling, and return the sampling records that meet the requirements. The dispersion ranges from 0 to 100, indicating the data range for sampling. The code is as follows:

The statistics relpages and reltuples are used to evaluate probability. Therefore, the table must contain the statistics, which are automatically collected by PG without manual intervention as long as autovacuum is enabled.

Consider the test below.

Randomly sample 10 records from the whole table.

Next, randomly sample 10 records from 20% of the table.

Also, randomly sample 10 records from 1% of the table.

Both the dispersion and performance meet the requirements.

Now, just remove ctid as shown below.

The following snippet shows the result.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store