Location Matching Filtering Syntax for PostgreSQL Full-Text Searches

Alibaba Cloud
5 min readOct 16, 2019

--

By Digoal.

Full-text search feature of PostgreSQL offers a rich array of functions. Besides the open parser and dictionary, it also support built-in and also some extended ranking algorithms.

In fact, location filtering is also an important aspect involved in search queries. Consider the term “Fast and Furious”, for example, it contains location information based on how the words are segmented. Therefore, when conducting a search, we should mind the distance between the words “Fast” and “Furious” so to be able to improve the accuracy of the search.

The distance search syntax of PostgreSQL is as follows:

select * from tbl where ts @@ 'Speed <Distance Value> Passion'::tsquery;  

For example:

select * from tbl where ts @@ 'Speed <1> Passion'::tsquery;

In the above example, the first line is the general template, and the last line is an example. The distance value of 1 indicates the relative distance between the terms ‘speed’ and ‘passion’.

Now let’s consider a more complex example. For this example, let’s look at two long phrases, and see how this system can analysis and work with these long, dense phrases.

Example Application

For this you’ll want to create a test table.

postgres=# create table ts_test (id int, info text, ts tsvector);  
CREATE TABLE

After that, you’ll want to write test data. Consider this data below:

postgres=# insert into ts_test values (1, 'Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights', to_tsvector('scwscfg', 'Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights'));  
INSERT 0 1
postgres=# insert into ts_test values (1, 'The Speed and Passion of Film 8 at the Box Office', to_tsvector('scwscfg', 'The Speed and Passion of Film 8 at the Box Office'));
INSERT 0 1

Next, view the segmentation results. In these results, you’ll see that the location information has been included in the segmentation.

postgres=# select * from ts_test;  
id | info | ts
----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
1 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'quality service':9 'CO.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'high':10 'Passion':1 'lights':11 'Speed':4
1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(2 rows)

Now you’ll want to write more interference data.

postgres=# insert into ts_test select 2, 'Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights', to_tsvector('scwscfg', 'Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights') from generate_series(1,1000000);  
INSERT 0 1000000

After this, create an index plug-in RUM, which supports location search.

postgres=# create extension rum;  
CREATE EXTENSION

Create a RUM segmentation index.

postgres=# CREATE INDEX rumidx ON ts_test USING rum (ts rum_tsvector_ops);  
CREATE INDEX

Next, you’ll want to check the query results. The first query does not include the location, so we can see that all records have been matched.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ts_test where ts @@ 'Speed & Passion'::tsquery;  
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using rumidx on public.ts_test (cost=9.60..41.24 rows=25 width=68) (actual time=223.602..456.956 rows=1000002 loops=1)
Output: id, info, ts
Index Cond: (ts_test.ts @@ '''Speed'' & ''Passion'''::tsquery)
Buffers: shared hit=38132
Planning time: 0.226 ms
Execution time: 541.545 ms
(6 rows)

The second query includes a location, and only matches one record, which is the “Fast and Furious” we want.

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ts_test where ts @@ 'Speed <1> Passion'::tsquery;  
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Index Scan using rumidx on public.ts_test (cost=9.60..41.24 rows=25 width=68) (actual time=270.675..270.677 rows=1 loops=1)
Output: id, info, ts
Index Cond: (ts_test.ts @@ '''Speed'' <-> ''Passion'''::tsquery)
Buffers: shared hit=1095
Planning time: 0.111 ms
Execution time: 273.840 ms
(6 rows)

PostgreSQL segmenter is relatively simple to use. Simply follow the instructions given below.

With a Distance Value Search

postgres=# select * from ts_test where ts @@ 'Speed <1> Passion'::tsquery;
id | info | ts
----+---------------------------+--------------------------------------------
1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(1 row)

Without a Distance Value Search

postgres=# select * from ts_test where ts @@ 'Speed & Passion'::tsquery limit 5;
id | info | ts
----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
1 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Tax':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'passion':1 'light':11 'Speed':4
2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
2 | Passion, Innovation, Persistence, and Speed—Beijing North Venture Taxi CO., LTD.'s quality service highlights | 'Quality Service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
(5 rows)

Last, if the distance is a range, we can also use user-defined function (UDF) to implement filtering.

-- Create a UDF that retrieves the distance internal between two words.
create or replace function get_lexeme_pos_range(tsvector, text, text) returns int4range as
$$
declare
a_pos int[];
b_pos int[];
m1 int;
m2 int;
begin
-- All the positions where the first word appears.
select positions into a_pos from (select * from unnest($1)) t where lexeme=$2;
-- All the positions where the second word appears.
select positions into b_pos from (select * from unnest($1)) t where lexeme=$3;
-- The position of where the two words appear, find the interval.
-- Return the range type.
select min(abs(t1.pos-t2.pos)), max(abs(t1.pos-t2.pos)) into m1,m2 from
(select unnest(a_pos) pos) t1 cross join
(select unnest(b_pos) pos) t2;
return int4range(m1,m2+1);
end;
$$
language plpgsql strict immutable;
-- Test, find the distance of 1 to 2 (excluding 2).postgres=# select get_lexeme_pos_range(ts, 'Speed', 'Passion'), * from ts_test where ts @@ tsquery 'Speed & Passion' and get_lexeme_pos_range(ts, 'Speed', 'Passion') && int4range(1,2) limit 1;
get_lexeme_pos_range | id | info | ts
----------------------+----+---------------------------+--------------------------------------------
[1,2) | 1 | The Speed and Passion of Film 8 at the Box Office | 'Passion':3 'Film':1 'at the':5 'Box Office':4 'Speed':2
(1 row)
Time: 0.713 ms-- Test, take the distance of 2 to 5 (excluding 5)postgres=# select get_lexeme_pos_range(ts, 'Speed', 'Passion'), * from ts_test where ts @@ tsquery 'Speed & Passion' and get_lexeme_pos_range(ts, 'Speed', 'Passion') && int4range(2,5) limit 1;
get_lexeme_pos_range | id | info | ts
----------------------+----+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------
[3,4) | 1 | Passion, Innovation, Persistence, Speed—Beijing North Venture Taxi Co., LTD.'s quality service highlights | 'quality service':9 'Co.':8 'Taxi':7 'Venture':6 'Innovation':2 'Beijing':5 'Persistence':3 'High':10 'Passion':1 'light':11 'Speed':4
(1 row)
Time: 0.682 ms

Original Source

--

--

Alibaba Cloud
Alibaba Cloud

Written by Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

No responses yet