Shopping Guide System Optimization: Application of E-Commerce Content Deduplication and Filtering

Background

Rapidly increasing volume of published content poses content optimization challenges for online sites. For instance, a trending event may be reported via many media, a good article may be reproduced by various readers, or a product/group of products may be pushed or promoted by many advertisement platforms or shopping guide platforms. This leads to a scenario where shopping guide websites, news media, technical forums, and search engines may have many similar or even duplicate articles or images.

These are minor issues if interests are not involved. However, problems may escalate if the vested interests of the parties involved are at stake.

For example, in an e-commerce scenario, shopping guide websites receive recommendation articles from various social groups. Such recommendation articles introduce many products and provide information about using these products. If an article covers most products in a manner similar to an already published shopping guide article, then usually such an article is not published to avoid a clash of interests.

Therefore, it is critical for websites to avoid publishing any similar or duplicate content. Usually, websites employ a machine review mechanism or assign personnel to review recommendation content to prevent similar or duplicate articles. However, manual review suffers serious bottlenecks, while machine review faces asynchronous batch review issues as every new article needs to be compared against all historical articles. So, the important question that arises is whether there are any technologies that support real-time content review by comparing each new article with all the existing ones?

The answer would be Yes, PostgreSQL technologies are ideal to address such challenges. This article gives you a walkthrough on how PostgreSQL technologies help to optimize e-Commerce content.

Real-time Deduplication: An Example

This section describes a content deduplication case to help you understand the concept better. On a shopping guide platform, every article recommends several products and give their basic introduction. For example, you may be visiting a shopping guide platform and be viewing an article that recommends toys.

Similarly, based on historical recommendations, there may be tens of millions to hundreds of millions of shopping guide articles. Further, each article may recommend dozens of products. Therefore, there may be tens of millions of recommended products. Also, hot products, such as bestsellers, are normally recommended in many articles. Moreover, some sellers may also pay fees to improve the frequency of recommendations generated for their products.

Let’s consider the following example to understand how to use the PostgreSQL smlar plug-in to deduplicate shopping guide articles in a timely manner. For more information about the smlar plug-in, you can check this document.

Now, assume that 1/5 of products in the shopping guide article library are hot products. Create a test database with 60 million shopping guide articles that involve 10 million products. Each article introduces 11 to 50 products. Products with IDs set to 1 to 500,000 are hot products, which are recommended by 10 million articles.

Step 1. Create the smlar plug-in.

Step 2. Create a test table.

Step 3. Insert 50 million records to the table. The requirements for this are as follows:
The int8 value range is from 1 to 10 million. Considering that there are 10 million recommended products.
The length of the int8[ ] array is from 11 to 50. As each article contains 11 to 50 products
The records are evenly distributed.

The following commands show the test data insertion function. Each time the function is called, 40 records are inserted.

Use pgbench to call the preceding function. 50 million test data records will be generated.

Step 4. Generate the recommendation data of 10 million hot products. For example, products with IDs set to 1 to 500,000 are hot products, which are recommended by 10 million articles.

Modify the preceding function.

Use pgbench to call the preceding function to generate 10 million test data records.

Data for 60 million recommendation articles are generated.

Step 5. Now, create a GIN index and use the OPS provided by the smlar plug-in.

Step 6. Next, implement the similarity calculation algorithms. The smlar plug-in supports several default algorithms and custom formulas. Note that the following parameters indicate the results after array element deduplication.

N. i: number of intersected elements
N. a: number of elements in the first array
N. b: number of elements in the second array
The default algorithms are as follows:

Step 6.1. Cosine: N.i/sqrt (N.a x N.b)

Step 6.2. Overlap: N.i

Step 6.3. TF-IDF: It is complex. For more information, refer to my blog post (in Chinese): Application of PostgreSQL Together with Cosine and Linear Correlation Algorithms in the Text, Image, and Array Similarity Fields (2) — smlar Plug-in Introduction

The configuration method is shown as below.

Refer to the following custom formulas to understand their usage.

Step 7. Next, you need to test performance. If an article on a shopping guide website recommends 10 products and 9 of those are the same as mentioned in another shopping guide article, this article is regarded as a duplicate article and will be rejected. You can also set the threshold for the number of the same products in different articles based on your own requirements.

If 10 recommended products in a shopping guide article occurs in another 10 articles with one product in each article, the article will be approved.

The preceding example uses the overlap formula to determine the similarity.

You can retrieve some existing records and create test data to determine the similarity.

Step 7.1 Use functions to simplify SQL statements.

Step 7.2 Test scenario: forty common products.

High similarity with 4 ms response time

Low similarity with 4 ms response time

Non-exist with 4 ms response time

Step 7.3 Test Scenario: ten hot products and thirty common products.

High similarity with 6 ms response time

Low similarity with 6 ms response time

Non-exist with 6 ms response time

Step 7.4. Test scenario: forty hot products.

High similarity with 15 ms response time

Low similarity with 15 ms response time

Non-exist with 15 ms response time

Step 7.5 Conduct a stress test where there are 35 common products, 5 hot products, and the overlap value is 35.

Use pgbench to conduct a stress test.

The stress test result shows that TPS is about 9,400, and CPU resources are used up.

Step 8. You can also use various other test queries, such as.

GIN Index Optimization

A Generalized Inverted (GIN) index is an element expansion index. Following steps describes the GIN index optimization process:

Step 1. Creation Optimization: To accelerate index creation, set a large maintenance_work_mem value while creating a GIN index.

Step 2. Update, Insertion, and Deletion Optimization: Multiple GIN entries need to be updated when an array is inserted or some records are updated or deleted. As a result, the GIN index is updated frequently. PostgreSQL designed a pending list to asynchronously update pending list information to a GIN index for reducing GIN index updates and improving data manipulation language (DML) efficiency. For more detailed information, refer to this article or read this blog

Step 3. Query Optimization: The pending list may affect query efficiency because it is not a tree structure. It is recommended that you balance the DML efficiency and query. Alternatively, if you find that the speed is slow, run the vacuum to merge the pending list to the tree.

GIN Search Principles

Data Retrieval Process (Example: smlar.type=overlap)

A GIN index creates a B-tree with the product IDs as the keys. The key values are heap table line numbers (CTIDs) that compose a posting list or posting tree. When a shopping guide article is received, the database searches for B-tree keys based on product IDs involved in the shopping guide article. If a key matches a product ID, the CNT value is increased by 1. If the CNT value is smaller than the smlar.threshold value, null is returned and no next-layer search is performed. However, if the CNT value is greater than or equal to the threshold, the next-layer search is performed.

The GIN index supports the common bitmap scan search, which is completed in the following two phases:

  • Bitmap Index Scan: In this phase, obtain the heap page IDs corresponding to the CTIDs from the posting list or posting tree corresponding to the keys. For example, obtain 1 and 1000 from (1,1), (1,2), (1,5), and (1000,32). 1 and 1,000 are mapped to a count (CNT), respectively. For example, CNT[1] = 1 and CNT[1000] = 1. Traverse all keys corresponding to product IDs involved in the shopping guide articles. When a key is hit, the CNT corresponding to the heap page ID, namely CNT[heap_page_id], increases by 1. For example, if product IDs involved in the shopping guide article are 1, 3, 101, 198, 798, and 10009, their CNTs are as follows: CNT[1] = 2, CNT[35] = 1, CNT[49] = 3, CNT[101] = 4, CNT[109] = 2, CNT[1000] = 2.

Filter heap page IDs that do not meet requirements based on smlar.threshold. For example, if smlar.threshold is 3, heap page IDs 49 and 101 meet requirements. Generate a bitmap where each bit corresponds to a heap table page. Therefore, the bitmap length depends on the number of pages in the heap table. Set bits corresponding to heap page IDs 49 and 101 that meet requirements to 1 and others to 0. This initiates the bitmap heap scan phase.

  • Bitmap Heap Scan: In this phase, extract all records from heap pages (that is, pages 49 and 101) whose bits are set to 1 in the bitmap. Then, use the CPU to recheck them based on the search criteria provided in the SQL statement.

Verification

Initiate verification process by following the steps listed below.
Step 1. Set the intersection mode to overlap.

Step 2. Set the threshold to 1. Prior to the heap page bitmap generation, heap page IDs that do not meet requirements are filtered off based on the threshold.

Step 2.1. View “Heap Blocks: exact = 4011”. It indicates that there are 4,011 heap page IDs that meet requirements.

Step 2.2. Now, set the threshold to 2 and verify as shown below.

Step 2.3 The following commands provide another example.

Step 2.4 Now, when the threshold is 4, no bit in the bitmap is set to 1 in the bitmap index scan phase. Therefore, in the bitmap heap scan phase, the number of scanned pages is 0.

It is simple to query the CTID or heap page ID.

Bitmap Scan Reference

The efficiency is verified before. If there are 60 million shopping guide articles (including 50 million for common products and 10 million for hot products) and 40 products (including 5 hot products and 35 common products), the TPS for real-time similarity determination is 10,000.

GIN or GiST: Which Is Better?

The GIN and GiST implementations of the smlar plug-in will be introduced later.

Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.