Build a Personalized Recommendation System on Alibaba Cloud in Three Steps

Step up the digitalization of your business with Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!

By Hanchao

Background

Overview

Figure 1. The overall framework of recommendation algorithms

1) Use the vector library in AnalyticDB for PostgreSQL to obtain user feature vectors: The system analyzes historical browsing data of users, constructs user profiles, and builds user preference models to obtain user feature vectors. Specifically, the system obtains details about the news read by users from their browsing logs and extracts keywords from each piece of the news to establish user profiles. If a user reads several pieces of National Basketball Association (NBA) playoff news, which contain keywords such as NBA, basketball, superstar, and sports, it indicates that the user is an NBA fan. The system uses the feature_extractor function in AnalyticDB for PostgreSQL to convert these keywords to vectors and import the vectors into the vector library of AnalyticDB for PostgreSQL to obtain the feature vector of the user.
2) Push news based on the vector library and the logistic regression prediction model of AnalyticDB for PostgreSQL: The system uses the vector library of AnalyticDB for PostgreSQL to retrieve the first 500 pieces of unread news which may interest the user. Then, it extracts the publish time and click-through rates (CTRs) of the 500 pieces of news and pushes news to the user based on the logistic regression prediction model. This model is obtained based on the browsing records of the user.

The feature_extractor function in AnalyticDB for PostgreSQL uses the Bidirectional Encoder Representations from Transformers (BERT) model. This model is trained based on a large number of corpora. It contains semantic information and has a higher query precision than the term frequency-inverse document frequency (TFIDF) algorithm.

Schema Design

Figure 2. Schema of tables in AnalyticDB for PostgreSQL in a personalized recommendation system

Let’s look at these three tables.

The news table stores information about news, including the news ID (news_id), publication time (create_time), title (title), body (content), total number of clicks (click_times), and number of clicks within two hours (two_hour_click_times). The personalized news recommendation system extracts keywords from the news title and the body and converts the keywords into a vector (news_vector). When data is inserted into the news table, the system automatically converts keywords into vectors and inserts the vectors and other news information into the news table.

CREATE TABLE news (
news_id bigint,
create_time timestamp,
title varchar(100),
content varchar(200),
keywords varchar(50),
click_times bigint,
two_hour_click_times bigint,
news_vector real[],
primary key (news_id)
) distributed by (news_id);

The browses_history table stores information about news read by users, including the news ID (news_id), user ID (person_id), and reading time of the news (browse_time).

CREATE TABLE browses_history (
browse_id bigint,
news_id bigint,
person_id bigint,
browse_time timestamp,
primary key (browse_id)
) distributed by (browse_id);

The person table stores user information, including the user ID (person_id), age (age), and star class (star).

CREATE TABLE person(
person_id bigint,
age bigint,
star float,
primary key (person_id)
) distributed by (person_id);

Implementing a Personalized Recommendation System

1) Extract News Feature Vectors

select feature_extractor('text', 'ADB For PG is very good!') ;

The following figure shows a piece of news. The personalized news recommendation system stores the news information to the news table in two steps:

1) Extract news keywords. AnalyticDB for PostgreSQL does not support the extract_tags function. Instead, call the (jieba.analyse.extract_tags(title + content, 3)) function in the jieba NLP system to extract keywords.
2) Execute the INSERT statement to store news information, including the keywords and news feature vectors, to the news table.

insert into news(news_id, create_time, title, content, 
keywords, click_times,two_hour_click_times)
values(1, now(),'South Korea military: North Korea launched a UFO eastward in the north Pyeongan Province','According to South Korea's Joint Staff Headquarters, at about 16:30 p.m local time today, North Korea launched a UFO eastward in the north Pyeongan Province.', 'South Korea, North Korea, UFO', 123, 3);

2) Extract User Feature Vectors

Based on the news browsing logs of the users, their browsing keywords can be obtained easily. For example, by executing the SELECT statement, the return result shows that the value of person_id is 9527.

select keywords    
from Person p, Browses_History bh, News n
where p.person_id = bh.person_id and bh.news_id = n.news_id and p.person_id = 9527;

2.2) Convert Browsing Keywords to User Feature Vectors.

The system extracts all browsing keywords of users in the same way. For example, the user whose person_id is 9527 reads the news with the following keywords: NBA sports, finals, Miami Heat, and Houston Rockets. The system then uses the feature_extractor function to convert the keywords to vectors.

select feature_extractor('text', 'NBA sports,finals,Miami Heat,Houston Rockets'));

3) Provide News Recommendations Based on User Feature Vectors

select news_id, title, content, (extract(epoch from (now()-create_time)) * w1 + click_times/extract(epoch from (now()-create_time)) * w2 + two_hour_click_times/extract(epoch from (now()-create_time)) * w3 + ann_distance * w4) as rank_score
from (select *, l2_distance(news_vector, feature_extractor('textf', 'NBA sports,finals,Miami Heat,Houston Rockets')) as ann_distance from news order by ann_distance desc limit 500) S
order by rank_score desc;

The parameters are described as follows:

  • ann_distance: indicates the degree of correlation between the user and the news.
  • create_time: indicates the time when the news was published.
  • click_times/(now()-create_time): indicates the total CTR of the news.
  • two_hour_click_times/(now()-create_time): indicates the CTR of the news within the last two hours.
  • w1, w2, w3, and w4: indicate the weights of all attributes during the learning of the logistic regression model.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.