Using the Built-In and Custom Ranking Algorithms of PostgreSQL for Full-Text Searches

By Digoal.

In my previous article in this series, Using PostgreSQL to Create an Efficient Search Engine , specifically in the section “Sorting Algorithms”, I previously discussed the PostgreSQL ranking algorithm. We will be discussing these algorithms and other related topics more in this article. Also, in that same previous article, we also discussed how tsvector divides the document into four levels: title, author, abstract, and content. For these four levels, you can set the corresponding weight for ranking computing. And, you can also set the correction mask of ranking.

However, there are limitations here. Four levels is far from meeting all business needs. So the question remains, is there a way for a more-refined ranking algorithm that can implement more levels in PostgreSQL? Is it possible to customize ranking, that is?

Also, consider the following real-world example. In the e-commerce industry, you can store tags for each shop, and each tag can have a corresponding coefficient. It is reasonable to say that the four levels system is insufficient here. Rather, a system based on these tags and the corresponding coefficients is more suitable. Coefficients can be dynamically adjusted. During the search, some of the shop’s tags are hit, and millions of shops may be searched, but finally, the way to rank them should be computed based on the weight because 10,000 of them can be obtained. In this case, the way to refine the ranking is shown.

In this article, we will explore how you can refine the ranking of shops from the e-commerce example above through two different methods, using tsvector and using a multi-dimensional array, to customize rankings.

Method 1: Using Tsvector

The first method is to use Tsvector. For this example, first look at the shop tag table:

For the tags field, we will be using a user-defined function (UDF) index. That is, we’ll make our own custom function. For our user-defined function, an array to store tags or a tsvector index can be used. If the tsvector is used, it will be our best choice then to use the PostgreSQL full-text search syntax, which, of course, contains things like Contain, Notcontain, and Distance, as we discussed in the previous article. Consider the following example:

Now consider the tag weight table:

Next, there’s a UDF to convert the text to the tag array and the tsvector:

Now, you’ll want to create a tsvector expression index

Below is the UDF for fetching the weights of the hit tags

If not hit, NULL is returned.

Then, when it comes to Full-text search, I recommend that you reference my article Using PostgreSQL to Create an Efficient Search Engine .

After which, let’s look at Refined ranking. The ranking may be obtained a ranking value based on the algorithm by combining the “hit tags, scores, and weights of the tags”. The score of the hit tag is obtained based on what you found for the weights of the hit tags, and the corresponding weight value is obtained from tbl_weight based on the hit tag.

The algorithm is encapsulated into the UDF, and finally the Ranking is obtained. The UDF content of the ranking algorithm is omitted here. Of course, though, make sure to compile the corresponding algorithm based on your own situation or business needs. What follows is some reference code:

For the process to follow, you can delete the tags and corresponding scores by using the regexp_replace function. Then, append tags and corresponding scores with the concat function. After that, then, modify the element score with the regexp_replace function. Note that all of the above can be operated using regular expressions.

Method 2: Using a Multi-Dimensional Array

For this second method, let’s use a multi-dimensional array. In reality, using arrays to store tags and weights is actually much easier to program than using tsvector, as we did in the previous example.

For this method, you’ll want to first understand some array functions before doing anything else.

Next, you’ll want take a look at the shop tag table:

And also the tag weight table:

For the array queries for Contain, Notcontain, and Intersection, see this page.

After all of that, you’ll want to create a refined ranking algorithm, which is similar to the one we used in method 1. A user-defined function (UDF) can be customized for this purpose. Overall, by using this method, you can simplify your overall development workload and increase the efficiency of the resulting system without using regular expressions.

For more information, I recommend that you check out this page.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store