Compatibility between PostgreSQL and Oracle — Data Sampling and Desensitization

Image for post
Image for post

Background

Data sampling and desensitization are common test functions. For example, if you use online services to create a test database, you cannot extract the entire database but need to encrypt sensitive data.

Consider the following Oracle example.

SELECT COUNT(innerQuery.C1) FROM (
SELECT ? AS C1FROM RM_SALE_APPORTION SAMPLE BLOCK (?, ?) SEED (?) "RM_SALE_APPORTION"
) innerQuery

SAMPLE [ BLOCK ]
(sample_percent)
[ SEED (seed_value) ]
A variant of the SAMPLE clause is SAMPLE BLOCK, where each block of
records has the same chance of being selected, 20% in our example.
Since records are selected at the block level, this offers a performance improvement for
large tables and should not adversely impact the randomness of the sample.
sample_clause
The sample_clause lets you instruct Oracle to select from a random sample of rows from the table, rather than from the entire table.
BLOCK
BLOCK instructs Oracle to perform random block sampling instead of random row sampling.
sample_percent
sample_percent is a number specifying the percentage of the total row or block count to be included in the sample. The value must be in the range .000001 to (but not including) 100.
Restrictions on Sampling During Queries
You can specify SAMPLE only in a query that selects from a single table. Joins are not supported.
However, you can achieve the same results by using a CREATE TABLE ... AS SELECT query to materialize a sample of an underlying table and then rewrite the original query to refer to the newly created table sample.
If you wish, you can write additional queries to materialize samples for other tables.
When you specify SAMPLE, Oracle automatically uses cost-based optimization. Rule-based optimization is not supported by this clause.
--------------------------------------------------------------------------------
Caution:
The use of statistically incorrect assumptions when using this feature can lead to incorrect or undesirable results.
--------------------------------------------------------------------------------

PostgreSQL also provides the sampling function as shown below.

  • BERNOULLI: corresponding to Oracle SAMPLE(), row-based sampling
  • SYSTEM: corresponding to Oracle SAMPLE BLOCK(), block-based sampling
  • REPEATABLE: corresponding to Oracle SEED, which is a sampling seed

Test Data

Consider the following as test data.

postgres=# create table test(id int primary key, username text, phonenum text, addr text, pwd text, crt_time timestamp);  
CREATE TABLE

postgres=# insert into test select id, 'test_'||id, 13900000000+(random()*90000000)::int, '中国杭州xxxxxxxxxxxxxxxxxx'||random(), md5(random()::text), clock_timestamp() from generate_series(1,10000000) t(id);
INSERT 0 10000000

postgres=# select * from test limit 10;
id | username | phonenum | addr | pwd | crt_time
----+----------+-------------+---------------------------------------------+----------------------------------+----------------------------
1 | test_1 | 13950521974 | 中国杭州xxxxxxxxxxxxxxxxxx0.953363882377744 | 885723a5f4938808235c5debaab473ec | 2017-06-02 15:05:55.465132
2 | test_2 | 13975998000 | 中国杭州xxxxxxxxxxxxxxxxxx0.91321265604347 | 7ea01dc02c0fbc965f38d1bf12b303eb | 2017-06-02 15:05:55.46534
3 | test_3 | 13922255548 | 中国杭州xxxxxxxxxxxxxxxxxx0.846756176557392 | 7c2992bdc69312cbb3bb135dd2b98491 | 2017-06-02 15:05:55.46535
4 | test_4 | 13985121895 | 中国杭州xxxxxxxxxxxxxxxxxx0.639280265197158 | 202e32f0f0e3fe669c00678f7acd2485 | 2017-06-02 15:05:55.465355
5 | test_5 | 13982757650 | 中国杭州xxxxxxxxxxxxxxxxxx0.501174578908831 | b6a42fc1ebe9326ad81a81a5896a5c6c | 2017-06-02 15:05:55.465359
6 | test_6 | 13903699864 | 中国杭州xxxxxxxxxxxxxxxxxx0.193029860965908 | f6bc06e5cda459d09141a2c93f317cf2 | 2017-06-02 15:05:55.465363
7 | test_7 | 13929797532 | 中国杭州xxxxxxxxxxxxxxxxxx0.192601112183183 | 75c12a3f14c7ef3e558cef79d84a7e8e | 2017-06-02 15:05:55.465368
8 | test_8 | 13961108182 | 中国杭州xxxxxxxxxxxxxxxxxx0.900682372972369 | 5df33d15cf7726f2fb57df3ed913b306 | 2017-06-02 15:05:55.465371
9 | test_9 | 13978455210 | 中国杭州xxxxxxxxxxxxxxxxxx0.87795089604333 | cbe233f00cdd3c61c67415c1f8691846 | 2017-06-02 15:05:55.465375
10 | test_10 | 13957044022 | 中国杭州xxxxxxxxxxxxxxxxxx0.410478914622217 | cdf2f98b0ff5a973efaca6a82625e283 | 2017-06-02 15:05:55.465379
(10 rows)

Sampling

For efficient sampling in versions earlier than 9.5, see Data Sampling in PostgreSQL.

In version 9.5 and later, use the TABLESAMPLE syntax for sampling (note that the sampling filter is used before the where condition filter).

The syntax is as follows, refer to this page for more details.

TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]  

sampling_method指采样方法

argument指参数,例如采样比例。

REPEATABLE(seed) 指采样随机种子,如果种子一样,那么多次采样请求得到的结果是一样的。如果忽略REPEATABLE则每次都是使用新的seed值,得到不同的结果。

Example 1) BERNOULLI (Percentage) Sampling

Scan the full table to return the sampling result according to the percentage of sampling parameters.

postgres=# select * from test TABLESAMPLE bernoulli (1);  
id | username | phonenum | addr | pwd | crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
110 | test_110 | 13967004360 | 中国杭州xxxxxxxxxxxxxxxxxx0.417577873915434 | 437e5c29e12cbafa0563332909436d68 | 2017-06-02 15:05:55.46585
128 | test_128 | 13901119801 | 中国杭州xxxxxxxxxxxxxxxxxx0.63212554808706 | 973dba4b35057d44997eb4744eea691b | 2017-06-02 15:05:55.465938
251 | test_251 | 13916668924 | 中国杭州xxxxxxxxxxxxxxxxxx0.0558807463385165 | 71217eedce421bd0f475c0e4e6eb32a9 | 2017-06-02 15:05:55.466423
252 | test_252 | 13981440056 | 中国杭州xxxxxxxxxxxxxxxxxx0.457073447294533 | 6649c37c0f0287637a4cb80d84b6bde0 | 2017-06-02 15:05:55.466426
423 | test_423 | 13982447202 | 中国杭州xxxxxxxxxxxxxxxxxx0.816960731055588 | 11a8d6d1374cf7565877def6a147f544 | 2017-06-02 15:05:55.46717
......

Example 2) SYSTEM (Percentage) Sampling

Perform block-based sampling to return the sampling result according to the percentage of sampling parameters (all records in the sampled data block are returned). Therefore, the dispersion of SYSTEM is lower than that of BERNOULLI, but the efficiency is much higher.

postgres=# select * from test TABLESAMPLE system (1);  
id | username | phonenum | addr | pwd | crt_time
---------+--------------+-------------+------------------------------------------------+----------------------------------+----------------------------
6986 | test_6986 | 13921391589 | 中国杭州xxxxxxxxxxxxxxxxxx0.874497607816011 | e6a5d695aca17de0f6489d740750c758 | 2017-06-02 15:05:55.495697
6987 | test_6987 | 13954425190 | 中国杭州xxxxxxxxxxxxxxxxxx0.374216149561107 | 813fffbf1ee7157c459839987aa7f4b0 | 2017-06-02 15:05:55.495721
6988 | test_6988 | 13901878095 | 中国杭州xxxxxxxxxxxxxxxxxx0.624850326217711 | 5056caaad5e076f82b8caec9d02169f6 | 2017-06-02 15:05:55.495725
6989 | test_6989 | 13940504557 | 中国杭州xxxxxxxxxxxxxxxxxx0.705925882328302 | a5b4062086a3261740c82774616e64ee | 2017-06-02 15:05:55.495729
6990 | test_6990 | 13987358496 | 中国杭州xxxxxxxxxxxxxxxxxx0.981084300205112 | 6ba0b6c9d484e6fb90181dc86cb6598f | 2017-06-02 15:05:55.495734
6991 | test_6991 | 13948658183 | 中国杭州xxxxxxxxxxxxxxxxxx0.6592857837677 | 9a0eadd056eeb6e3c1e2b984777cdf6b | 2017-06-02 15:05:55.495738
6992 | test_6992 | 13934074866 | 中国杭州xxxxxxxxxxxxxxxxxx0.232706854119897 | 84f6649beac3b78a3a1afeb9c3aabccd | 2017-06-02 15:05:55.495741
......

To customize a sampling method visit this website.

Desensitization

Many desensitization methods are available for the many different scenarios in which users must desensitize data. Common examples include:

1) Use asterisks to hide the content in the middle of a string but keep the original length.
2) Use asterisks to hide the content in the middle of a string but do not keep the original length.
3) Return the encrypted value.

In all cases, the desensitization operation converts the original value to the target value. PostgreSQL allows using functions to implement such conversion. For different requirements, just write different conversion logic.

For example, use asterisks to hide the content in the middle of a string, with only the first two and the last one characters of the string displayed.

select id, substring(username,1,2)||'******'||substring(username,length(username),1),   
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1);

id | ?column? | ?column? | ?column? | ?column? | crt_time
---------+-----------+-----------+-------------+-----------+----------------------------
69 | te******9 | 13******5 | 中国******9 | c0******2 | 2017-06-02 15:32:26.261624
297 | te******7 | 13******2 | 中国******1 | d9******6 | 2017-06-02 15:32:26.262558
330 | te******0 | 13******5 | 中国******3 | bd******0 | 2017-06-02 15:32:26.262677
335 | te******5 | 13******5 | 中国******6 | 08******f | 2017-06-02 15:32:26.262721
416 | te******6 | 13******6 | 中国******2 | b3******d | 2017-06-02 15:32:26.26312
460 | te******0 | 13******4 | 中国******8 | e5******f | 2017-06-02 15:32:26.26332
479 | te******9 | 13******1 | 中国******1 | 1d******4 | 2017-06-02 15:32:26.263393
485 | te******5 | 13******0 | 中国******3 | a3******8 | 2017-06-02 15:32:26.263418
692 | te******2 | 13******9 | 中国******4 | 69******8 | 2017-06-02 15:32:26.264326
1087 | te******7 | 13******9 | 中国******3 | 8e******5 | 2017-06-02 15:32:26.266091
1088 | te******8 | 13******8 | 中国******7 | 37******e | 2017-06-02 15:32:26.266095
1116 | te******6 | 13******8 | 中国******2 | 4c******3 | 2017-06-02 15:32:26.266235
1210 | te******0 | 13******4 | 中国******8 | 49******c | 2017-06-02 15:32:26.266671
......

For a more complex conversion, write a PostgreSQL UDF to change the field values.

There are also many methods to extract sampling results on other platforms, such as copying to StdOut or ETL tools.

Consider the example below.

psql test -c "copy (select id, substring(username,1,2)||'******'||substring(username,length(username),1),   
substring(phonenum,1,2)||'******'||substring(phonenum, length(phonenum),1),
substring(addr,1,2)||'******'||substring(addr, length(addr),1),
substring(pwd,1,2)||'******'||substring(pwd, length(pwd),1),
crt_time
from test
TABLESAMPLE bernoulli (1)
) to stdout" > ./sample_test.log


less sample_test.log
54 te******4 13******4 中国******3 52******b 2017-06-02 15:32:26.261451
58 te******8 13******6 中国******3 23******a 2017-06-02 15:32:26.261584
305 te******5 13******6 中国******9 c0******4 2017-06-02 15:32:26.262587
399 te******9 13******5 中国******4 71******7 2017-06-02 15:32:26.26298
421 te******1 13******0 中国******4 21******3 2017-06-02 15:32:26.263139
677 te******7 13******5 中国******5 e2******7 2017-06-02 15:32:26.264269
874 te******4 13******9 中国******2 a6******9 2017-06-02 15:32:26.265159

References

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store