Support for Encryption, Full-text Search, and Fuzzy Query of Multi-language Strings

By Digoal

Background

It is easy to implement full-text searches and fuzzy queries in PostgreSQL. The pg_trgm plugin allows you to use indexes to speed up fuzzy queries (fuzzy prefix, suffix, or middle-string query), similarity queries, and regexp queries. You can also use word-break plugins to break down characters in multiple languages, for example, to break down Chinese characters you can use pg_jieba, pg_scws, and zhparser.

However, what if you want your database to support encrypted storage and perform fuzzy queries or full-text searches on data before it is encrypted? This article addresses these challenges in detail in the following sections.

Data Encryption

There are several types of data encryption and we recommend that you choose a proper encryption type depending on your actual scenario.

Reversible Encryption

Reversible encryption (for example, pgcrypto) supports encryption methods like symmetric encryption. and allows you to store data in an encrypted form. However, it is not easy to create an index on encrypted data.

Fortunately, PostgreSQL supports expression indexes. It means you can create indexes on decrypted expressions to implement acceleration. However, note that expressions of indexes change to the form prior to the encryption. Therefore, avoid this encryption type to reduce the risk.

Irreversible Encryption

Comparatively, it is even more difficult to create indexes on data that uses the irreversible encryption (for example, MD5). However, creating indexes is executable for this encryption type and is very secure.

Transparent Encryption

Transparent encryption requires the transformation of kernels. Data is stored in encrypted form and encryption/decryption is automatically performed during the input/output process.

Full-text Search and Fuzzy Query Example for Reversibly Encrypted Data

The following example illustrates how to implement full-text search and fuzzy queries for reversibly encrypted data.

Step 1. Create the crypto plugin.

Step 2. Create an immutable key function by using the superuser account.

If it is a C function, you can hide it better. Also, you can store the key in an external table such as another database, LDAP or another network service that is accessible through the PostgreSQL fdw interface. It is significant to note that the user can obtain the key only after the corresponding query permission is granted.

Step 3. Hide the code of the key function as shown below.

Step 4. Next, set access to the key function.

Step 5. Create a test table.

Step 6. Now create an expression (decrypted) index.

Step 7. Write data as shown in the example below.

Step 8. Now, perform a query as shown in the example below.

Query the data using the index.

Step 9. Support full-text searches and fuzzy queries. Create an index based on the expression- enc(info, f1(1), ‘cipher-algo=bf, compress-algo=2, compress-level=9’) to implement full-text searches and fuzzy queries.

Full-text Search and Fuzzy Query Example for Irreversibly Encrypted Data

To support full-text searches and fuzzy queries for the irreversible encryption type, you need to convert strings to tokens, irreversibly encrypt tokens and store them into arrays.

While executing queries, convert the query string to tokens, irreversibly encrypt these tokens and retrieve inclusive or overlapping data in arrays by using the index. This will speed up the query and ensure the highest security.

Consider the following example table.

1. Write

Step 1. Use a language-specific full-text search plugin to tokenize strings for performing a full-text search. Obtain the tokens of words and phrases (tsvector) and convert tsvector to arrays.

Step 2. Convert content strings to tokens (double-character tokens).

Step 3. Convert content strings to tokens (single-character tokens). Use regexp_split_to_array to obtain single-character arrays.

Step 4. Now, encrypt tokens as shown below.

Step 5. Store encrypted arrays.

Step 6. Create an index on arrays using GIN.

2. Query

Step 1. Convert the query string to tokens and encrypt tokens.

Step 2. Now, perform the first level of filtering.

Step 3. Perform the second level of filtering by using the power of CPUs. The first level of filtering uses the GIN index and the second level filters require CPUs to perform operations. This ensures both high efficiency and security.

Summary

(1) Transparent encryption is of course the best. However, implementing transparent encryption requires the transformation of PostgreSQL kernels.

  • Data Breach: Secure
  • Database Attack: Secure

(2) Irreversible encryption features high security, but can exponentially increase storage usage.

  • Data Breach: Secure
  • Database Attack: Secure

(3) Reversible encryption has an average level of security. The security level is based on the prerequisite that secure key functions are used (not plaintext keys). However, even if secure key functions are used, the plaintext can still be stolen as a result of database attacks.

  • Data Breach: Insecure
  • Database Attack: Limited Security.

Data is not secured when the superuser or owner’s permission is obtained without authorization.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store