PCC Social Media “Like” Scenario — Database Design and Performance Stress Testing

By Digoal

Background

The Performance Challenge Championship (PCC) is an event organized by ArchNotes. After learning about the rules of the competition, I found PostgreSQL is very suitable for this scenario. The scenario is reproduced as it is, implemented with PG, but how does it perform?

The competition is described as follows (page in Chinese): https://github.com/archnotes/PCC

To Implement the like Feature Similar to That in Facebook:

  • The Like operation can be performed on an object (a feed, an article, or a URL), but the operation is not allowed twice. Otherwise, an error code is returned when the Like operation is performed the second time on the same object.
  • An isLike interface is available, which returns the result indicating whether the object specified by the parameter has been liked by the current user.
  • The Like count of an object needs to be seen.
  • The Like user list (similar to QQ space) of an object can be seen.
  • Points in the above list: Like gives priority to displaying the friends list (social list).
  • Data volume: The number of new Like objects per day is 10 million, and the number of Like counter queries per second is 300,000 times.

Data Volume

  • The number of users is 100 million, the number of friends is 1–10,000, and the number of likes for a single object is 1–100,000.
  • The competition dataset (in plain text format) is provided, and participants need to import it into their own database.

Test Dataset Format Definition

Download the test data: https://github.com/archnotes/PCC/tree/master/data (non-stress testing data)

User Data Format

Uid is in uint64, allowing 100 million data entries.

Format of User Friend Data

Uid and friend_id is in uint64, and only two-way friend relations exist. Friend relation is usually a long-tailed distribution. For a friend relation of 100 million users * 1000, 100 or less friends in the relation range of 90%, 300–1000 friends in the relation range of 8%, and 1000–10,000 friends in the relation range of 2%

The like List Data Format of an Object

Oid and uid is in uint64. 200 million objects, and 1–100,000 likes per object

Database Design

The relation between users is following (being followed) or mutual following.

The relation between users and objects is liking or disliking.

During the design, detailed data and statistical data are divided. Statistical data is to query the relation and number being followed more quickly.

Detailed data can be recorded in logs or databases. Statistical data (relation, count, been liked) is written into the database in a streaming manner.

Relation Design

Structure Design

Query Implementation

1. Users that the user follows

2. Users that follow the user. It is not involved in this scenario (you can create a reverse relation table if necessary).

3. Objects that the user likes. It is not involved in this scenario (you can create a reverse relation table if necessary).

4. Users that like the object

5. The number of times the object has been liked

6. Which users who like the object are also my friends?

Demo

Create a stream. The “Follow” behavior is written to the stream and also to the detail data (optional).

Create a continuous view and make real-time statistics based on the “Follow” behavior.

Activate StreamCompute

The Follow operation function is to determine whether the user is followed or not. If the user is already followed, an exception is returned. Otherwise, the user is followed. (This can also be written in the program, but it needs to interact with the database many times, which is not good)

The function can be adjusted based on actual needs. For example, if you need to return the array after being liked, just query the continue view.

Test

Stress Testing 1

1. User ID range

1–100 million

2. Article ID range

1–200 million

3. Hot article ID range

A total of 200 million articles, using Gaussian distribution for LIKE operation. The article IDs distributed in the range of 2.0/xx centered on the top of the bell-shaped curve cover 95% of the occurrence probability. The article IDs distributed in the range of 1.0/xx cover 67% of the occurrence probability.

The closer the horizontal axis is to the value of the top of the curve (that is, the article ID = 100 million), the higher the probability of occurrence.

The smaller the value of xx, the sharper the curve, that is, the fewer hot articles.

4. Random users like random articles

5. Random users like hot articles

First, Generate Basic Data Based on the above Requirements

Basic data includes the stress testing script, liked articles, and article IDs generated by using Gaussian distribution. After a long time of stress testing, the number of articles being liked presents Gaussian distribution, and the article IDs at the top of the curve is liked the most times.

xx is set to 10.0, indicating the article IDs in the range of 20% centered on the top of the bell-shaped curve cover 95% of the occurrence probability . The article IDs distributed in the range of 10% cover 67% of the occurrence probability.

The larger the value of xx, the higher the probability of the article IDs at the top of the curve.

256 connections are stress tested, resulting in 177,000 LIKE requests per second.

Number of articles after staged stress testing

The results are in line with expectations, and the stress testing can be continued. (Or we can choose the exponential distribution for testing)

CPU usage is as follows when no optimization is performed

The continuous stress testing on the Like operation generates the Like data of 200 million articles, and then stress testing 2 is carried out.

Generate User Relation Data

1. User ID range

1–100 million

2. User friend distribution

100 or less friends in the relation range of 90%, 300–1000 friends in the relation range of 8%, and 1000–10,000 friends in the relation range of 2%

Generate 90% of user relations

Generate 8% of user relations

Generate 2% of user relations

Finally, 100 million users are generated, occupying 123 GB of space and 2.7 GB of index.

Stress Testing 2

1. Who likes the article?

2. How many times has the article been liked?

3. Which of my friends like the article?

Stress testing script 1: Who likes the article?

Stress testing script 2: How many times has the article been liked?

Stress testing script 3: Which of my friends like the article?

Stress testing result 1: Who likes the article. It is not unexpected to reach 1.01 million/s.

Stress testing result 2: How many times has the article been liked? The result is 1.04 million/s.

Stress testing result 3: Which of my friends like the article? The result is 648,000/s.

Optimization Methods

1. The longer the array, the larger the space occupied by a record. Using TOAST slicing storage can effectively improve the efficiency of querying non-array fields.

2. Profiling for targeted optimization.

3. Compression interface to reduce the CPU consumption of PGLZ compression.

https://commitfest.postgresql.org/21/1294/

Summary

The most common operations in Weibo and Facebook:

1. Follow a user or like a message or a post.

This is the write operation, requiring fast writing and reflecting immediately after writing (being liked or followed).

2. Query the friends list

To query quickly, the fastest method is PK query, but one user may follow many users. Therefore, if it is to query multiple records, it is obviously slow.

An array can be used to store the friends list.

However, you must consider the write speed when using an array to store the list.

Therefore, it is best to use StreamCompute aggregation, because PG has a StreamCompute plug-in that can complete stream computation in the database.

3. Query the list of friends to be followed

This is the reverse friend relation, which also requires fast query and the same method as the forward relation.

4. Query the number of times articles (posts) have been liked, users who like the articles (posts), and which of them are my friends.

First, for the number of times articles (posts) have been liked, only a counter is actually needed. To increase the query speed, it must be a VALUE, instead of using COUNT(*) for aggregation during query.

For users who like the articles (posts), an array is also considered for storage to improve the query speed. Built-in StreamCompute of PG is used for aggregation.

It is very simple to determine which of the users are my friends. You only need intersect the two arrays, the friends list and users who like the articles (posts).

StreamCompute of PG solves the problem of real-time writing and real-time aggregation.

And, data is aggregated in real time, so several query requirements can be easily implemented.

Performance indicators (not optimized) obtained after tests:

1. Like posts (articles)

177,000 articles/s, which is expected to be optimized to 300,000 articles/s.

2. Who likes the articles?

1,016,000/s

3. How many times has the article been liked?

1,041,000/s

4. Which of my friends like the article?

648,000/s

5. Devices:

(X86 with a price of about 100 thousand yuan, 12*8 TB SATA hard drive, and one SSD as BCACHE)

The built-in StreamCompute feature in the database is great.

Original Source

https://www.alibabacloud.com/blog/pcc-social-media-like-scenario---database-design-and-performance-stress-testing_595042?spm=a2c41.13112404.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store