Using PostgreSQL to Create an Efficient Search Engine

select * from tbl where col like 'ab%';  

or

select * from tbl where col ~ '^ab';
select * from tbl where col like '%ab';  

or

select * from tbl where col ~ 'ab$';

Writing

select * from tbl where reverse(col) like 'ba%';

Or

select * from tbl where reverse(col) ~ '^ba';
select * from tbl where col like '%ab%';  

or

select * from tbl where col ~ 'ab';
select * from tbl where tsvector_col @@ 'postgres & china | digoal:A' order by ts_rank(tsvector_col, 'postgres & china | digoal:A') limit xx;  We will discuss the specific syntax later.
select * from tbl where col ~ '^a[0-9]{1,5}\ +digoal$';
select * from tbl order by similarity(col, 'postgre') desc limit 10;
select * from tbl where a=? and b=? or c=? and d=? or e between ? and ? and f in (?);

Full-text search

The core features of full-text search are: dictionaries, word-breaking syntax, search syntax, sorting algorithms, search efficiency mechanisms, and hit-highlighting, each of which will be discussed below in detail.

Dictionaries

One important feature of dictionaries is that the word base allows for word breaking. By default, work breaking is supported and works automatically for English. This functions works pretty well. This is relatively easy to understand-with there being nothing too special for us to say-so we won’t go into too much detail here.

Word-Break Syntax

So, now that we’ve discussed dictionaries and how to add word-break support if you don’t have it for your particular language, let’s get more in-depth into how word-breaking works at a more microscopic level.

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  
alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html
postgres        pgsql  
postgresql pgsql
postgre pgsql
gogle googl
indices index*
mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');  
mydb=# SELECT ts_lexize('syn','indices');
ts_lexize
-----------
{index}
(1 row)

mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst','indices');
to_tsvector
-------------
'index':1
(1 row)

mydb=# SELECT to_tsquery('tst','indices');
to_tsquery
------------
'index':*
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector;
tsvector
---------------------------------
'are' 'indexes' 'useful' 'very'
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
?column?
----------
t
(1 row)
ALTER TEXT SEARCH CONFIGURATION tsconfig name  
ADD MAPPING FOR token type 1 WITH dictionary 1, dictionary 2, dictionary 3;

If the tsconfig is used to convert the text into a tsvector, Token Type 1 will first match against Dictionary 1. If they match successfully, lexemes in Dictionary 1 will be stored. If they do not match, the search proceeds to Dictionary 2.

Search syntax

When it comes to search syntax, there’s some important concepts to cover. One of these is that a tsquery stores lexemes that are to be searched for, and supports the operators AND, OR, NOT, and FOLLOWED BY. For example, to_tsquery creates a tsquery value from query text, which must consist ofsingle tokens separated by the tsquery operators:

& (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) and <?> (How far away is it?),
c There are two positions both of which can be used when matching distances.
postgres=# select to_tsvector('a b c c');
to_tsvector
---------------------
'a':1 'b':2 'c':3,4
(1 row)
Adjacent
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <-> b');
?column?
----------
t
(1 row)
Adjacent, means that, when the positions are subtracted by each other, they are equal to one.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <1> b');
?column?
----------
t
(1 row)
When the distance is 2, then, when the positions are subtracted by each other, they are equal to 2.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> c');
?column?
----------
t
(1 row)
When the distance is 3, then, when the positions are subtracted by each other, they are equal to 3.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <3> c');
?column?
----------
t
(1 row)
When the distance is 2, then, when the positions are subtracted by each other, they are equal to 2.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> b');
?column?
----------
f
(1 row)
SELECT to_tsquery('english', 'Fat | Rats:AB');  
to_tsquery
------------------
'fat' | 'rat':AB
SELECT to_tsquery('supern:*A & star:A*B');  
to_tsquery
--------------------------
'supern':*A & 'star':*AB
SELECT to_tsquery('''supernovae stars'' & !crab');  
to_tsquery
---------------
'sn' & !'crab'
select * from tbl where $tsvector_col @@ $tsquery;

Sorting Algorithms

When it comes down to sorting algorithms, full-text searching usually involves two aspects: matching and ranking.

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
{D-weight, C-weight, B-weight, A-weight}
{0.1, 0.2, 0.4, 1.0}
0 (the default) ignores the document length  

1 divides the rank by 1 + the logarithm of the document length

2 divides the rank by the document length

4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)

8 divides the rank by the number of unique words in document

16 divides the rank by 1 + the logarithm of the number of unique words in document

32 divides the rank by itself + 1
SELECT title, ts_rank_cd(textsearch, query) AS rank  
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
title | rank
-----------------------------------------------+----------
Neutrinos in the Sun | 3.1
The Sudbury Neutrino Detector | 2.4
A MACHO View of Galactic Dark Matter | 2.01317
Hot Gas and Dark Matter | 1.91171
The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
Rafting for Solar Neutrinos | 1.9
NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
Hot Gas and Dark Matter | 1.6123
Ice Fishing for Cosmic Neutrinos | 1.6
Weak Lensing Distorts the Universe | 0.818218
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank  
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
title | rank
-----------------------------------------------+-------------------
Neutrinos in the Sun | 0.756097569485493
The Sudbury Neutrino Detector | 0.705882361190954
A MACHO View of Galactic Dark Matter | 0.668123210574724
Hot Gas and Dark Matter | 0.65655958650282
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
Rafting for Solar Neutrinos | 0.655172410958162
NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
Hot Gas and Dark Matter | 0.617195790024749
Ice Fishing for Cosmic Neutrinos | 0.615384618911517
Weak Lensing Distorts the Universe | 0.450010798361481

Search Efficiency Mechanisms

When it comes to efficiency, there are several mechanisms to improve the overall system. First there is the write efficiency mechanism, which randomly obtains 64 words from a lexicon of 5 million words and makes up a string that is consisting of the 64 words. GIN indexes for full text searches are included. Next, there is also the query efficiency mechanism, which performs a full-text search on just 100 million text records.

Special features

HTML Highlighting Feature

Of the special features, first there is the HTML highlighting feature, which can highlight matching text. It is as follows:

ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text
SELECT ts_headline('english',  
'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query & similarity'));

ts_headline
------------------------------------------------------------
containing given <b>query</b> terms
and return them in order of their <b>similarity</b> to the
<b>query</b>.

SELECT ts_headline('english',
'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query & similarity'),
'StartSel = <, StopSel = >');
ts_headline
-------------------------------------------------------
containing given <query> terms
and return them in order of their <similarity> to the
<query>.

Generate Document Statistics Feature

Next, there is the generate document statistics feature. How this feature works is that the sqlquery returns a tsvector column and counts which lexemes are included in this column. Furthermore, it also counts how many texts each lexeme appears and how many times each lexeme totally appears. An example of this one is as follows:

ts_stat(sqlquery text, [ weights text, ]  
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record
word text - the value of a lexeme  

ndoc integer - number of documents (tsvectors) the word occurred in

nentry integer - total number of occurrences of the word
SELECT * FROM ts_stat('SELECT vector FROM apod')  
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')  
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

Setting Document Structure Feature

Next, there’s the setting document structure feature, which set which document structure a tsvector belongs to, which would, of course, be one of the following structures: title, author, abstract, or body. Consider the following example:

1、  
setweight(vector tsvector, weight "char")

assign weight to each element of vector

setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A')
'cat':3A 'fat':2A,4A 'rat':5A

2、
setweight(vector tsvector, weight "char", lexemes text[])

assign weight to elements of vector that are listed in lexemes

setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}')
'cat':3A 'fat':2,4 'rat':5A

Debugging Text Feature

Last, there is the special feature for debugging text. This figure can be used to debug a text for tokens. See this document for more information about this feature. Consider the following example.

ts_debug([ config regconfig, ] document text,  
OUT alias text,
OUT description text,
OUT token text,
OUT dictionaries regdictionary[],
OUT dictionary regdictionary,
OUT lexemes text[])
returns setof record
alias text - short name of the token type

description text - description of the token type

token text - text of the token

dictionaries regdictionary[] - the dictionaries selected by the configuration for this token type

dictionary regdictionary - the dictionary that recognized the token, or NULL if none did

lexemes text[] - the lexeme(s) produced by the dictionary that recognized the token,
or NULL if none did; an empty array ({}) means it was recognized as a stop word
ts_lexize(dict regdictionary, token text) returns text[]
SELECT ts_lexize('english_stem', 'stars');  
ts_lexize
-----------
{star}

SELECT ts_lexize('english_stem', 'a');
ts_lexize
-----------
{}

Limitations

Now that we have looked at the core features of full-text search, in addition to some additional special features, now it’s time to consider some of the limitations of this query type. First, among other limitations of this query type, there is the fact that the length of each lexeme must be less than 2K bytes. Furthermore, the length of a tsvector (lexemes + positions) must be less than 1 megabyte. Consider the following error related to this limit:

postgres=# select length(to_tsvector(string_agg(md5(random()::text), ' '))) from generate_series(1,100000);  
ERROR: 54000: string is too long for tsvector (3624424 bytes, max 1048575 bytes)
LOCATION: make_tsvector, to_tsany.c:185

Other Query Types: Fuzzy, Regexp, and Similarity queries

Fuzzy queries, regexp queries, and similarity queries are all more or less related to text matching. In PostgreSQL, GIN and pg_trgm is provided to speed up these three types of searches by using indexes.

Ad Hoc Searches

Ad Hoc searches refer to searches that rely on a combination of several different fields. For example, if a table has N columns and all the N columns may be the query criteria. Therefore, the number of all possible combinations is N!. PostgreSQL provides several methods to implement efficient searching in this kind of scenarios:

  • Partition tables, which can implement convergence when a search is based on several various fields.
  • Bloom filter indexes, which support equivalent searches in any combination of fields. Similar to these are lossy filters.
  • GIN multi-column composite indexes and BITMAP scans. For these, searching based on any combinations is supported, and filtering can be implemented at the block level.
  • Multiple single-column indexes and BITMAP scans. For these, databases are automatically optimized, then, the option of whether to choose index scans or bitmap scans is determined based on the cost evaluation. Next, searching based on any combination of fields is supported. Filtering at the block level is also supported.

References

1. Full-text search
https://www.postgresql.org/docs/10/static/textsearch.html
https://www.postgresql.org/docs/10/static/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES

Original Source

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com