Using HybridDB/PostgreSQL to Find the Best Product Combination for Marketing

Background

Typically when talking about product combinations, you would think of products that are closely related in terms of form and function. For instance, in China, stuffed buns, soybean milk, and tea eggs are common breakfast items. So it wouldn’t come as a surprise if these products are promoted together in a supermarket.

But then, you’ve probably heard of the infamous beer and diapers correlation. Why do these seemingly disparate items are the best product combination?

For these questions, we can find answers based on accumulated order data. The question “how to use PostgreSQL to find the best production combination” was raised by a friend in the PostgreSQL community during the PostgreSQL community activity on April 8.

In fact, this problem can also be solved by using a professional recommendation database that supports a variety of recommendation algorithms.

However, this article does not intend to use RecDB to solve this problem. Instead, this article uses the traditional statistics to reach a conclusion.

Limitation of the Statistical Method Described in This Article

The statistical method covered in this article can only be used to calculate the best combination of directly related items (data included in the same order).

To calculate the combination of indirectly associated items (for example, if user A bought item 1 and item 2 and user B bought item 2 and item 3, item 1 and item 3 have an indirect relationship), you need to use the recommendation algorithm in RecDB or use a similar graph search method.

Scenario Simulation

Assume that there is a total of 100,000 item IDs and simulate a batch of users’ order or shopping cart records with each order or shopping cart record containing 5–10 items. Create a total of around 11 million of these records.

Create a table

Construct Data

Create a function that inserts data into the buy table (array of 5–10 items)

Use pgbench to generate 11 million records

Confirm that the data has been successfully written

GIN Indexes

Split Functions

The purpose of splitting is to split the array in an order into several sets. For example, an order containing five items can be split into 10 (4+3+2+1) sets of 2 items.

Split the order into the following sets

Create a function to complete the split work

Use a recursive query for recombination

Example

Function 1 That Returns a Specified Number of Sets

Assume that the array doesn’t include replicate elements

Function 2 That Returns All Sets

Function 3 That Returns a Specified Number of Sets and That Only Displays Sets Containing Certain Elements (For Example, Arrays That Contain the Bread ID)

Find the One Best Combination Item for a Single Item

For example, find the best combination item for bread.

Assume that the item ID of bread is 6537.

Split order data into sets and find the sets that have the highest occurrence of this item ID.

Find the Two Best Combination Items for a Single Item

For example, find the two best combination items for bread.

Find the One Best Combination Item throughout the Network

This may take a long time.

Find the N Best Combination Items throughout the Network

This may take a long time.

Summary

1. This case doesn’t require highly technical methods, and only splits arrays by recommendation level to count the number of occurrences.

The following database features are used in this case:

1.1. Support for the array type

1.2. PL/pgSQL programming on the service side

1.3. Retrieval of array elements by index (containing a specific element)

1.4. The MPP distributed database architecture for computation performance enhancement For more information, refer to Alibaba Cloud HybridDB for PostgreSQL.

2. Note that the statistical method in this article has a limitation.

The statistical method covered in this article can only be used to calculate the best combination of directly related items (data included in the same order).

To calculate the combination of indirectly associated items (for example, if user A bought item 1 and item 2 and user B bought item 2 and item 3, item 1 and item 3 have an indirect relationship), you need to use the recommendation algorithm in RecDB or use a similar graph search method.

3. Alibaba Cloud HybridDB for PostgreSQL provides the MPP feature to support horizontal scaling, which is very suitable for OLAP scenarios. For example, the “group by” operation which is used many times in this case can result in significant performance improvement.

4. The CPU-based multi-core parallel computing is added in PostgreSQL 9.6 and can significantly improve performance in OLAP scenarios. For example, the “group by” operation which is used many times in this case can have significant performance improvement.

References

https://github.com/DataSystemsLab/recdb-postgresql

https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html

Original Source

https://www.alibabacloud.com/blog/using-hybriddbpostgresql-to-find-the-best-product-combination-for-marketing_594994?spm=a2c41.13103612.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store