Compatibility between PostgreSQL and Oracle — Data Sampling and Desensitization

Background

Data sampling and desensitization are common test functions. For example, if you use online services to create a test database, you cannot extract the entire database but need to encrypt sensitive data.

Consider the following Oracle example.

PostgreSQL also provides the sampling function as shown below.

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]

  • BERNOULLI: corresponding to Oracle SAMPLE(), row-based sampling
  • SYSTEM: corresponding to Oracle SAMPLE BLOCK(), block-based sampling
  • REPEATABLE: corresponding to Oracle SEED, which is a sampling seed

Test Data

Consider the following as test data.

Sampling

For efficient sampling in versions earlier than 9.5, see Data Sampling in PostgreSQL.

In version 9.5 and later, use the TABLESAMPLE syntax for sampling (note that the sampling filter is used before the where condition filter).

The syntax is as follows, refer to this page for more details.

Example 1) BERNOULLI (Percentage) Sampling

Scan the full table to return the sampling result according to the percentage of sampling parameters.

Example 2) SYSTEM (Percentage) Sampling

Perform block-based sampling to return the sampling result according to the percentage of sampling parameters (all records in the sampled data block are returned). Therefore, the dispersion of SYSTEM is lower than that of BERNOULLI, but the efficiency is much higher.

To customize a sampling method visit this website.

Desensitization

Many desensitization methods are available for the many different scenarios in which users must desensitize data. Common examples include:

1) Use asterisks (*) to hide the content in the middle of a string but keep the original length.
2) Use asterisks (*) to hide the content in the middle of a string but do not keep the original length.
3) Return the encrypted value.

In all cases, the desensitization operation converts the original value to the target value. PostgreSQL allows using functions to implement such conversion. For different requirements, just write different conversion logic.

For example, use asterisks (*) to hide the content in the middle of a string, with only the first two and the last one characters of the string displayed.

For a more complex conversion, write a PostgreSQL UDF to change the field values.

There are also many methods to extract sampling results on other platforms, such as copying to StdOut or ETL tools.

Consider the example below.

References

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.