Double Eleven Technology Series: Flash Sales Optimization on PostgreSQL

Alibaba Cloud
8 min readOct 30, 2018

11.11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.

Flash sales has always been indispensable for e-commerce as well as for other online marketing efforts. For Alibaba, flash sales during the Double Eleven online shopping festival (Singles’ Day) on November 11 is a necessity. However, the scale of Alibaba’s Double Eleven poses significant challenges towards the servers and databases that support Alibaba’s e-commerce platform.

Due to the competitive nature of flash sales, it’s not unusual to see customers coming up with innovative but unfair ways to gain an upper hand over other customers. For example, on every November 11, Alibaba’s transaction platforms receive heavy stress from bots, scripts, and external plug-ins for automated rush purchasing systems.

To cope with these challenges, the team at Alibaba has come up with many ways to optimize flash sales. For example, databases may adopt the queuing mechanism, asynchronous messages, or transaction merging to optimize transactions. In this article, I will introduce a flash sales optimization measure called advisory locks (AD LOCK) for PostgreSQL.

Flash Sales Typical Scenario

Although you may already be very familiar with flash sales, I still want to introduce some relevant ideas to make this article complete.

For example, let’s imagine that there is one set of iPhone priced USD $1 for sale. This iPhone can be regarded as a record. When the flash sales starts, the customer who successfully buys this iPhone (updates the lock of this record) can win.

The challenge to the database is that the database must concurrently handle many update requests on only one record. Only one or a few requests can be successful, and the others fail or cannot update the record.

For example, if there are 100 iPhones for flash sales and 1 million customers participate in the flash sales, the minimum granularity for the database is a row lock. When one customer is updating this record, the other 999,999 customers are waiting. But ideally, the database should only process the 100 lucky winners in the flash sales. It is meaningless to process the requests from the customers who will not be receiving the iPhone as this will only waste resources.

Traditionally, a flag bit is used to indicate whether this record has been updated or how many times (how many iPhones) the record has been updated.

update tbl set xxx=xxx,upd_cnt=upd_cnt+1 where id=pk and upd_cnt+1<=5;   -- Assume that five iPhones are for flash sales.

However, this method has the following defect:

The customer obtaining the lock may be successful or fail in handling this record, or may handle the record for a long time (because the database responds slowly). Before the transaction is finished, other sessions have to wait. Waiting is unnecessary and a waste of time for customers who did not obtain the lock.

Commonly Used Methods to Coping with Flash Sales

Generally, for update, nowait can be used to avoid waiting. That is, if a customer does not obtain the lock, the customer does not need to wait.

begin;
select 1 from tbl where id=pk for update nowait; -- If the customer does not obtain the lock, an error is returned, and the transaction is rolled back.
update tbl set xxx=xxx,upd_cnt=upd_cnt+1 where id=pk and upd_cnt+1<=5;
end;

This method shortens the waiting time, because an error is returned when the lock is not obtained.

Another method is request merging. That is, multiple update requests are merged into one update request. This method requires a change to kernel and will damage ACID, because if the merged request fails, all the merged update requests fail. (This is different from group update. Group update will not damage the ACID.)

Next, let’s learn about ad lock.

What Is Ad Lock?

Ad lock is a user-oriented lightweight lock. Its object is an integer. The lock types include transaction, session, shared, and exclusive.

In a single database, a lock can be obtained with a unique integer value. If the integer value is repeated, a new lock can be added using TRY. FALSE is returned immediately when no lock is obtained.

https://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS

NameReturn TypeDescriptionpg_advisory_lock(key bigint)voidObtain exclusive session level advisory lockpg_advisory_lock(key1 int, key2 int)voidObtain exclusive session level advisory lockpg_advisory_lock_shared(key bigint)voidObtain shared session level advisory lockpg_advisory_lock_shared(key1 int, key2 int)voidObtain shared session level advisory lockpg_advisory_unlock(key bigint)booleanRelease an exclusive session level advisory lockpg_advisory_unlock(key1 int, key2 int)booleanRelease an exclusive session level advisory lockpg_advisory_unlock_all()void Releaseall session level advisory locks held by the current sessionpg_advisory_unlock_shared(key bigint)booleanRelease a shared session level advisory lockpg_advisory_unlock_shared(key1 int, key2 int)booleanRelease a shared session level advisory lockpg_advisory_xact_lock(key bigint)voidObtain exclusive transaction level advisory lockpg_advisory_xact_lock(key1 int, key2 int)voidObtain exclusive transaction level advisory lockpg_advisory_xact_lock_shared(key bigint)voidObtain shared transaction level advisory lockpg_advisory_xact_lock_shared(key1 int, key2 int)voidObtain shared transaction level advisory lockpg_try_advisory_lock(key bigint)booleanObtain exclusive session level advisory lock if availablepg_try_advisory_lock(key1 int, key2 int)booleanObtain exclusive session level advisory lock if availablepg_try_advisory_lock_shared(key bigint)booleanObtain shared session level advisory lock if availablepg_try_advisory_lock_shared(key1 int, key2 int)booleanObtain shared session level advisory lock if availablepg_try_advisory_xact_lock(key bigint)booleanObtain exclusive transaction level advisory lock if availablepg_try_advisory_xact_lock(key1 int, key2 int)booleanObtain exclusive transaction level advisory lock if availablepg_try_advisory_xact_lock_shared(key bigint)booleanObtain shared transaction level advisory lock if availablepg_try_advisory_xact_lock_shared(key1 int, key2 int)booleanObtain shared transaction level advisory lock if available

Generally, the minimum lock (opened to customers) supported by database is row lock. Compared with LWLOCK and SPINLOCK, row lock is heavy. Therefore, row lock is often a bottleneck in flash sales, including the lock waiting time.

Purpose of Ad Lock

In addition to supporting flash sales, ad lock has the following functions:

  1. Concurrent security check
  2. UPSERT in recursive call
  3. Ensuring atomic operation in service logic design

Performance of Ad Lock

Ad lock is lightweight. It does not need to access data or execute much code, so it has a high efficiency.

A 32-core, 64-thread computer can handle 1.31 million requests per second.

vi test.sql
\set id random(1,100000000)
select pg_try_advisory_xact_lock(:id);
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 96 -j 96 -T 100transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 96
number of threads: 96
duration: 100 s
number of transactions actually processed: 131516823
latency average = 0.072 ms
latency stddev = 0.070 ms
tps = 1314529.211060 (including connections establishing)
tps = 1315395.309707 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set id random(1,100000000)
0.074 select pg_try_advisory_xact_lock(:id);

Example of Using Ad Lock in Flash Sales

A product has a unique ID in the database. We can lock the ID (if the ID is repeated in different tables, add offset or use other measures to avoid conflict).

The row is locked and updated only when the ID is successfully locked. This avoids ineffective row lock waiting and redundant query code.

The QPS of ad lock in handling concurrent update requests for one record can reach 391,000/s. The state of product will be changed to sold out quickly, avoiding a waste of database resources.

create table test(id int primary key, crt_time timestamp);
insert into test values (1);
vi test.sql
update test set crt_time=now() where id=1 and pg_try_advisory_xact_lock(1);
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 100transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 100 s
number of transactions actually processed: 39104368
latency average = 0.163 ms
latency stddev = 0.216 ms
tps = 391012.743072 (including connections establishing)
tps = 391175.983419 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.163 update test set crt_time=now() where id=1 and pg_try_advisory_xact_lock(1);

Now, the database host has 66.2% available CPU resource.

top - 13:12:43 up 51 days, 18:41,  2 users,  load average: 1.12, 0.97, 0.78
Tasks: 1463 total, 28 running, 1435 sleeping, 0 stopped, 0 zombie
Cpu(s): 24.5%us, 9.3%sy, 0.0%ni, 66.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529321832k total, 235226420k used, 294095412k free, 903076k buffers
Swap: 0k total, 0k used, 0k free, 62067636k cached

Comparison with Traditional Method

Typically, select for update nowait is used to avoid waiting.

begin;
select 1 from tbl where id=pk for update nowait; -- If the customer does not obtain the lock, an error is returned, and the transaction is rolled back.
update tbl set xxx=xxx,upd_cnt=upd_cnt+1 where id=pk and upd_cnt+1<=5;
end;

PG can use the do clauses to merge the preceding commands into one block.

The traditional method can handle 86,000 requests per second.

vi test.sql
do language plpgsql
$$ declare begin with t as (select * from test where id=1 for update nowait) update test set crt_time=now() from t where t.id=test.id; exception when others then return; end; $$;pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 100transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 64
number of threads: 64
duration: 100 s
number of transactions actually processed: 8591222
latency average = 0.744 ms
latency stddev = 0.713 ms
tps = 85888.823884 (including connections establishing)
tps = 85924.666940 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.744 do language plpgsql
$$ declare begin with t as (select * from test where id=1 for update nowait) update test set crt_time=now() from t where t.id=test.id; exception when others then return; end; $$;

There is 54.5% available CPU resource.

top - 13:13:48 up 51 days, 18:42,  2 users,  load average: 8.14, 2.69, 1.37
Tasks: 1464 total, 21 running, 1442 sleeping, 0 stopped, 1 zombie
Cpu(s): 41.7%us, 3.8%sy, 0.0%ni, 54.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 529321832k total, 235256052k used, 294065780k free, 903176k buffers
Swap: 0k total, 0k used, 0k free, 62068308k cached

Real Deduction Throughput of a Single Product

When testing the deduction throughput of a product, you can find that the QPS is high. However, the actual deduction throughput is lower because an error is returned if no lock is obtained.

postgres=# create table upd(id int primary key, cnt int8);
postgres=# insert into upd values(1,0);
vi t0.sql
update upd set cnt=cnt-1 where id=1 and pg_try_advisory_xact_lock(1);
\sleep 10 us
....
vi t7.sql
update upd set cnt=cnt-1 where id=1 and pg_try_advisory_xact_lock(1);
\sleep 80 us
pgbench -M prepared -n -r -P 1 -f t0.sql -f t1.sql -f t2.sql -f t3.sql -f t4.sql -f t5.sql -f t6.sql -f t7.sql -c 64 -j 64 -T 100
postgres=# select * from upd;
id | cnt
----+---------
1 | -611249
(1 row)

The inventory of a single product is deducted 6112.49 times per second. Generally, the inventory of the products for flash sales is within 1000; otherwise, customers do not need to scramble for the products.

Therefore, this value can meet requirement.

Real Deduction Throughput of All Products on the Platform

Assume 10 million products are available on the platform. Test the total deduction throughput using this method.

postgres=# create table upd(id int primary key, cnt int8);
postgres=# insert into upd select generate_series(1,10000000), 0;
vi test.sql
\set id random(1,10000000)
update upd set cnt=cnt-1 where id=:id and pg_try_advisory_xact_lock(:id);
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -T 100postgres=# select sum(cnt) from upd;
sum
-----------
-27233112
(1 row)

Check the cnt for the real deduction statistics. On the entire platform, 272331.12 products are deducted per second. That is, 270,000 products are sold every second.

Advantages of Ad Lock Compared with Other Flash Sales Optimization Methods

Ad lock reduces CPU utilization and shortens waiting time to the minimum. According to the test case in this article, a record can be updated 391,000 times per second. However, the traditional method can reach only 86,000/s.

Furthermore, ad lock does not damage the ACID. A request affects a single transaction, and does not affect other transactions. In contrast, the merging method damages the ACID. If the merging fails, all merged request will fail.

To learn more about PostgreSQL on Alibaba Cloud, visit www.alibabacloud.com/product/apsaradb-for-rds-postgresql

Reference:https://www.alibabacloud.com/blog/double-eleven-technology-series%3A-flash-sales-optimization-on-postgresql_594096?spm=a2c41.12185884.0.0

--

--

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com