Using PostgreSQL to Create an Efficient Search Engine

select * from tbl where col like 'ab%';  

or

select * from tbl where col ~ '^ab';
select * from tbl where col like '%ab';  

or

select * from tbl where col ~ 'ab$';

Writing

select * from tbl where reverse(col) like 'ba%';

Or

select * from tbl where reverse(col) ~ '^ba';
select * from tbl where col like '%ab%';  

or

select * from tbl where col ~ 'ab';
select * from tbl where tsvector_col @@ 'postgres & china | digoal:A' order by ts_rank(tsvector_col, 'postgres & china | digoal:A') limit xx;  We will discuss the specific syntax later.
select * from tbl where col ~ '^a[0-9]{1,5}\ +digoal$';
select * from tbl order by similarity(col, 'postgre') desc limit 10;
select * from tbl where a=? and b=? or c=? and d=? or e between ? and ? and f in (?);

Full-text search

Dictionaries

Word-Break Syntax

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  
alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html
postgres        pgsql  
postgresql pgsql
postgre pgsql
gogle googl
indices index*
mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');  
mydb=# SELECT ts_lexize('syn','indices');
ts_lexize
-----------
{index}
(1 row)

mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst','indices');
to_tsvector
-------------
'index':1
(1 row)

mydb=# SELECT to_tsquery('tst','indices');
to_tsquery
------------
'index':*
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector;
tsvector
---------------------------------
'are' 'indexes' 'useful' 'very'
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
?column?
----------
t
(1 row)
ALTER TEXT SEARCH CONFIGURATION tsconfig name  
ADD MAPPING FOR token type 1 WITH dictionary 1, dictionary 2, dictionary 3;

If the tsconfig is used to convert the text into a tsvector, Token Type 1 will first match against Dictionary 1. If they match successfully, lexemes in Dictionary 1 will be stored. If they do not match, the search proceeds to Dictionary 2.

Search syntax

& (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) and <?> (How far away is it?),
c There are two positions both of which can be used when matching distances.
postgres=# select to_tsvector('a b c c');
to_tsvector
---------------------
'a':1 'b':2 'c':3,4
(1 row)
Adjacent
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <-> b');
?column?
----------
t
(1 row)
Adjacent, means that, when the positions are subtracted by each other, they are equal to one.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <1> b');
?column?
----------
t
(1 row)
When the distance is 2, then, when the positions are subtracted by each other, they are equal to 2.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> c');
?column?
----------
t
(1 row)
When the distance is 3, then, when the positions are subtracted by each other, they are equal to 3.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <3> c');
?column?
----------
t
(1 row)
When the distance is 2, then, when the positions are subtracted by each other, they are equal to 2.
postgres=# select to_tsvector('a b c c') @@ to_tsquery('a <2> b');
?column?
----------
f
(1 row)
SELECT to_tsquery('english', 'Fat | Rats:AB');  
to_tsquery
------------------
'fat' | 'rat':AB
SELECT to_tsquery('supern:*A & star:A*B');  
to_tsquery
--------------------------
'supern':*A & 'star':*AB
SELECT to_tsquery('''supernovae stars'' & !crab');  
to_tsquery
---------------
'sn' & !'crab'
select * from tbl where $tsvector_col @@ $tsquery;

Sorting Algorithms

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4
{D-weight, C-weight, B-weight, A-weight}
{0.1, 0.2, 0.4, 1.0}
0 (the default) ignores the document length  

1 divides the rank by 1 + the logarithm of the document length

2 divides the rank by the document length

4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)

8 divides the rank by the number of unique words in document

16 divides the rank by 1 + the logarithm of the number of unique words in document

32 divides the rank by itself + 1
SELECT title, ts_rank_cd(textsearch, query) AS rank  
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
title | rank
-----------------------------------------------+----------
Neutrinos in the Sun | 3.1
The Sudbury Neutrino Detector | 2.4
A MACHO View of Galactic Dark Matter | 2.01317
Hot Gas and Dark Matter | 1.91171
The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
Rafting for Solar Neutrinos | 1.9
NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
Hot Gas and Dark Matter | 1.6123
Ice Fishing for Cosmic Neutrinos | 1.6
Weak Lensing Distorts the Universe | 0.818218
SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank  
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
title | rank
-----------------------------------------------+-------------------
Neutrinos in the Sun | 0.756097569485493
The Sudbury Neutrino Detector | 0.705882361190954
A MACHO View of Galactic Dark Matter | 0.668123210574724
Hot Gas and Dark Matter | 0.65655958650282
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
Rafting for Solar Neutrinos | 0.655172410958162
NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
Hot Gas and Dark Matter | 0.617195790024749
Ice Fishing for Cosmic Neutrinos | 0.615384618911517
Weak Lensing Distorts the Universe | 0.450010798361481

Search Efficiency Mechanisms

Special features

HTML Highlighting Feature

ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text
SELECT ts_headline('english',  
'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query & similarity'));

ts_headline
------------------------------------------------------------
containing given <b>query</b> terms
and return them in order of their <b>similarity</b> to the
<b>query</b>.

SELECT ts_headline('english',
'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query & similarity'),
'StartSel = <, StopSel = >');
ts_headline
-------------------------------------------------------
containing given <query> terms
and return them in order of their <similarity> to the
<query>.

Generate Document Statistics Feature

ts_stat(sqlquery text, [ weights text, ]  
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record
word text - the value of a lexeme  

ndoc integer - number of documents (tsvectors) the word occurred in

nentry integer - total number of occurrences of the word
SELECT * FROM ts_stat('SELECT vector FROM apod')  
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')  
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

Setting Document Structure Feature

1、  
setweight(vector tsvector, weight "char")

assign weight to each element of vector

setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A')
'cat':3A 'fat':2A,4A 'rat':5A

2、
setweight(vector tsvector, weight "char", lexemes text[])

assign weight to elements of vector that are listed in lexemes

setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}')
'cat':3A 'fat':2,4 'rat':5A

Debugging Text Feature

ts_debug([ config regconfig, ] document text,  
OUT alias text,
OUT description text,
OUT token text,
OUT dictionaries regdictionary[],
OUT dictionary regdictionary,
OUT lexemes text[])
returns setof record
alias text - short name of the token type

description text - description of the token type

token text - text of the token

dictionaries regdictionary[] - the dictionaries selected by the configuration for this token type

dictionary regdictionary - the dictionary that recognized the token, or NULL if none did

lexemes text[] - the lexeme(s) produced by the dictionary that recognized the token,
or NULL if none did; an empty array ({}) means it was recognized as a stop word
ts_lexize(dict regdictionary, token text) returns text[]
SELECT ts_lexize('english_stem', 'stars');  
ts_lexize
-----------
{star}

SELECT ts_lexize('english_stem', 'a');
ts_lexize
-----------
{}

Limitations

postgres=# select length(to_tsvector(string_agg(md5(random()::text), ' '))) from generate_series(1,100000);  
ERROR: 54000: string is too long for tsvector (3624424 bytes, max 1048575 bytes)
LOCATION: make_tsvector, to_tsany.c:185

Other Query Types: Fuzzy, Regexp, and Similarity queries

Ad Hoc Searches

  • Partition tables, which can implement convergence when a search is based on several various fields.
  • Bloom filter indexes, which support equivalent searches in any combination of fields. Similar to these are lossy filters.
  • GIN multi-column composite indexes and BITMAP scans. For these, searching based on any combinations is supported, and filtering can be implemented at the block level.
  • Multiple single-column indexes and BITMAP scans. For these, databases are automatically optimized, then, the option of whether to choose index scans or bitmap scans is determined based on the cost evaluation. Next, searching based on any combination of fields is supported. Filtering at the block level is also supported.

References

Original Source

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AliORC: A Combination of MaxCompute and Apache ORC

Using rails config in gem

Highly Effective Kubernetes Deployments with GitOps

Starting with OpenShift

A fresher’s perspective: Jupyter Notebook or Desktop IDE?

SQL Notebooks — The EASIEST Way to Build and Visualise a Conversion Funnel

Packing — Puzzle for Interview Rounds

What Is Kubernetes? — An Introduction To Container Orchestration Tool

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Using Search Template — ElasticSearch

Using Debezium Source Connector and JDBC Sink with Kafka Connect on AWS RDS PostgreSQL

How to Connect Elastic Sink Connector with Kafka

Apache Kafka Installation on Mac