How to Ensure a Pleasant Chat between Hundreds of Millions of Buyers and Sellers during Double 11

Step up the digitalization of your business during the Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!

By Alibaba Cloud Storage


Ten months ago, DingTalk faced the rigid online office requirements caused by the sudden outbreak, posing significant challenges to the cost, maintenance, and stability of existing storage systems. After a lot of research, DingTalk chose Tablestore from the Alibaba Cloud Storage Team as the unified storage platform. Through the close cooperation between the DingTalk Team and the Tablestore Team, all of the message synchronization data and full message data of the IM system of DingTalk and Multi-tenant PaaS was migrated. During the 2020 Double 11 Shopping Festival, Tablestore provided full support for the Instant Messaging (IM) of Alibaba Group for the first time to ensure smoother communication between hundreds of millions of buyers and sellers.

This article describes the IM storage architecture of DingTalk and a series of works done by Tablestore to meet migration requirements in terms of stability, functionality, and performance.

1. The Message Systems of Tablestore and DingTalk

DingTalk Message Systems

2. Upgrading of Storage Architecture

The preceding figure shows the original architecture of the IM system. In this system, a message is stored twice after it is sent:

  1. Full Message Storage: Step 4 in the figure shows the message stored in DB for persistence. This message is mainly used to call the system to display the homepage information of the app. Full messages are permanently retained.
  2. Synchronous Protocol Storage: As shown in Step 5, the message is stored in Tablestore and then read, merged, and pushed to users. The data of synchronous protocol storage is retained for a specified number of days.

After the architecture upgrades, the dependency on DB is removed from full message storage, and all data is written into Tablestore. After the upgrading, the IM system storage depends only on Tablestore, bringing the following benefits:

  1. Low Costs: The architecture of Tablestore based on the LSM storage engine is superior in terms of costs compared with the architecture of database and table sharding. The Erasure Code (EC) enabled for data storage, hot and cold separated storage, and the pay-as-you-go billing method of cloud services can reduce the costs of DingTalk. Based on the current billing methods, migrating storage data to Tablestore could save DingTalk over 60% on storage costs.
  2. Strong Elasticity and Scalability: Any number of machines can be scaled out as needed, and the scale-out speed is fast without affecting the business. After the machines are prepared (including system cloning), the scale out can be completed in minutes. After handling the peak cluster traffic, container service allows for rapid scaling in to save resources.
  3. Zero O&M: Tablestore is a fully managed cloud service that does not require users to undertake any O&M work. Tablestore can ensure uninterrupted services during O&M. Tablestore uses a schema-free architecture, so users do not have to worry about the table structure adjustment caused by changes in business requirements.

3. Stability Assurance

3.1. Disaster Recovery Based on Primary and Secondary Clusters

To avoid service unavailability caused by a fault on a single data center, Tablestore supports disaster recovery based on primary and secondary clusters. Primary and secondary clusters are two independent clusters. Data is replicated in the asynchronous background mode. This disaster recovery method cannot ensure strong data consistency and causes a delay of several seconds. When the primary cluster fails in a single data center, all traffic is switched to the secondary cluster, which provides services to ensure service availability.

3.2. Highly Consistent Disaster Recovery Provided by 3AZ

Although primary/secondary disaster recovery can solve the disaster recovery problem of a single data center, strong data consistency cannot be achieved when the primary cluster is down unexpectedly and the switchover process requires coordination with the business (tolerating data inconsistency for a time and subsequent data patching.) The three available zone (3AZ) disaster recovery modes can solve these problems.

The 3AZ disaster recovery deployment method is to deploy the system on three physical data centers. When writing data, the Apsara Distributed File System evenly distributes the three copies of the data across the three data centers. This can ensure that when the failure occurs in any data center, all of the data in the other two data centers must be available to prevent any data loss. Each time Tablestore returns a successful data write, three copies are written to disks in three data centers. Therefore, the 3AZ architecture ensures strong data consistency. When one of the data centers fails, Tablestore schedules the partitions to the other two data centers to continue to provide services. The 3AZ deployment mode has the following advantages over the primary/secondary clusters:

  1. Low Costs: Three copies of the underlying Apsara Distributed File System are evenly distributed in three data centers, while one copy of data is required for the primary/secondary clusters. EC can be used later to reduce costs.
  2. Strong consistency of data between data centers (RPO = 0).
  3. Smoother Switching: Since the data is strongly consistent, the business does not need additional processing logic during switching.

3.3. Load Balancing System

During system operation, design or service issues often lead to unbalanced data access and distribution, such as data skew and local hotspot partitioning. There are two big problems to be solved:

  1. How can these problems be discovered automatically?
  2. How can these problems be handled automatically after discovery?

The emphasis on automation here is mainly because Tablestore, as a multi-tenant system, runs a large number of instances and tables. If it relies on people for feedback and processing, it is not only costly but also has no operability. To solve this problem, Tablestore has designed its own load balancing system. There are three main points:

  1. It collects detailed statistics on partitions for subsequent analysis.
  2. It computes split nodes based on traffic and the discovery of hotspots.
  3. Multi-dimensional fine-grained grouping is mainly used to manage partition distribution.

On this basis, the load balancing system has also developed dozens of strategies to solve various types of hot issues.

This load balancing system can automatically identify and process 99% of hot issues that occur in clusters. It provides the capability to automatically resolve problems in minutes and locate and process problems in minutes. The system is currently running in all domains.

3.4. Perfect Traffic Control System

1. Instance-Level and Table-Level Global Traffic Control: The purpose is to solve the problem when the traffic exceeds the expectation and prevents the cluster traffic from exceeding the limit. As shown in the following figure, the traffic control architecture is on the left, and the traffic control model is on the right, where access traffic can be controlled at the instance and table levels, respectively. If the instance-level traffic exceeds the upper limit, all of the requests of the instance will be affected. You can also set the maximum table traffic at the table level to avoid affecting other tables when the traffic volume of a single table exceeds the upper limit. Based on this system, Tablestore can precisely control the traffic of the specific instances, tables, and operations, provide multi-dimensional personalized control measures, and guarantee the Service Level Agreement (SLA) of services. During the early stages of the pandemic, the cluster traffic of DingTalk exceeded the upper limit many times, and the global traffic control solved this problem.

2. Active Partition Traffic Control: Tablestore partitions data by range. When business data is skewed, reading and writing the partition hotspots often occur. The active partition traffic control mainly solves the single-partition hotspot problem to prevent single-partition hotspots from affecting other partitions. Access information about each partition is counted during the process, including but not limited to access traffic, the number of rows, and the number of cells.

The traffic control system implements a multi-level defense and guarantees service stability.

4. Extreme Performance Optimization

In the DingTalk scenarios, after a series of optimization, under the same SLA and hardware resources, the read/write throughput was improved by more than three times, and the read/write latency was reduced by 85%. The improvement in read/write performance also saves more machine resources and reduces business costs. The following is a brief introduction to the main performance work.

4.1 Storage and Network Optimization

  1. Tablestore Proxy: As the unified access layer for user requests, it authenticates requests, verifies their validity, and forwards them. After all the preceding checks are performed, the system forwards the requests to the Table engine layer.
  2. Table Engine: The Table engine uses an LSM tree as its processing model to provide distributed table capabilities.
  3. Apsara Distributed File System: It uses a distributed file system designed to persist data during the data processing of the Table engine.

In the request links, the first-hop network (Tablestore proxy to Table engine) and the second-hop network (Table engine to Apsara Distributed File System) are based on the kernel tcp network framework. Data will be copied, and the system CPU will be consumed multiple times. In terms of storage network optimization, the Luna, the new generation of user-state Remote Procedure Call (RPC) framework, is used as the data transmission mode for the first hop, reducing the data copy times of the kernel and the system CPU consumption. In the second-hop of data persistence, the Remote Direct Memory Access (RDMA) network library provided by the Apsara Distributed File System is used to further reduce data transmission latency.

As a storage base, the Apsara Distributed File System 2.0 is applied to the next-generation networks, storage software, and hardware for architecture design and engineering optimization. It releases the dividend of the development of software and hardware technology. It achieves great performance improvements on the data storage links, and the read/write latency has reached 100 microsecond-level. Based on the optimized architecture of the storage base, the Table engine has upgraded its data storage path, and has been connected to the Apsara Distributed File System 2.0 to connect to the I/O path.

4.2. Iterator Optimization

  1. Multiple-versions of the data column
  2. Schema-free of columns, which is a wide table model
  3. Flexible deletion semantics, such as row deletion, column multi-version deletion, and specific column version deletion
  4. Expiration of Time To Live (TTL) data

During the implementation of each strategy, to achieve fast iteration, the iterator system is implemented in the form of a volcano model (as shown in the figure on the left half.) One problem caused by this method is that the iteration level gets deeper as the strategy gets more complex. Even if the strategy is not used, it will be added to the system. The dynamic polymorphism in this process results in the high costs of function call. From the perspective of R&D, it is difficult to form a unified perspective, which also brings obstacles to the follow-up optimization.

To optimize the data iteration performance, the context information required by each strategy is assembled into a strategy class, and the policy-related iterators are flattened from a global perspective. Thus, the performance loss of a nested iterator level in reading links is reduced. In multi-channel merging, the multi-column primary key jump capability provided by the column storage is used to reduce the comparison costs of multi-column merging and sorting, improving data reading performance.

4.3. Lock Optimization

  1. In Implicit Transactions: The exclusive lock is converted into a read/write lock. Only requests that require transaction assurance are serialized through a write lock. Requests that do not require transactions on regular paths are parallelized through maximum read locks.
  2. In the Read Path: Locks are also introduced into the least recently used (LRU) function of cache for critical section protection. In high concurrent reading, the LRU strategy can easily cause lock competition. Therefore, a lock-free LRU cache policy is implemented, which improves the read performance.

5. Practice-Based Functional Innovation

5.1. Primary Key Auto-Increment +1 Function

  1. If the user sends a message and the application server receives the message, the application server writes the message to Tablestore. If the message is successfully written, the auto-increment ID of the message will be returned.
  2. The application server queries all messages between the two push IDs based on the message ID that has been pushed last time. Then, the server pushes the queried message to the user.

One special scenario is that the user has only one unpushed message. If the returned message ID is guaranteed to be auto-incremented, it also ensures that the number is automatically increased by one. Then, after the application server writes a message, if it finds that the returned ID is the only one that is different from the ID pushed last time, it does not need to query from Tablestore and can directly push the message to the user.

With the introduction of the “primary key auto-increment +1 function,” the number of application-side reads has been reduced by more than 40%, and the CPU consumption of Tablestore servers and your own business servers has been reduced. Therefore, the business costs for DingTalk have also been reduced.

5.2. Local Secondary Index (LSI)

  • Scenarios

In the scenario below, users can directly find all of the messages that mention them.

For the business side, a user message table stores all of the messages of the user. In this table, an LSI table of the latest messages that mention the user is created. Then, the user can scan the LSI table to obtain all the messages that mention the user. The two table structures are shown in the following figures:

The Main Message Table
The LSI Table
  • Building Index of Incremental Data

For the building of incremental data, we examined each module of the full link and made extreme performance optimization. It takes about 10us to create a row of an index:

  1. By analyzing the schema of the table and comparing it with the data written by the current user, it can save unnecessary reading before writing.
  2. By recording the original value before data is updated to the log, the amount of I/O data written in logs after the index is successfully built.
  3. By batch indexing, all index rows of the index table are computed in one building process.
  4. By computing before the log is entered, the multi-thread parallel capability is fully utilized.
  • Building an Index of the Existing Data

The building of existing data is a challenge in the industry since LSI requires consistency. If an index is written successfully, it can be read from the index table. At the same time, during the building index of existing data, we hope to minimize the impact on online writing since writing can’t be stopped.

As shown in the following figure, traditional databases do not solve this problem properly. During the conversion from the building of existing data to incremental data, tables need to be locked, which forbids user writing, resulting in service suspension. DynamoDB does not support the construction of LSIs, including the existing data on the table.

The Method for the Building Index of Incremental Data of Traditional Databases
The Method for Building an Index of Existing Data in Tablestore

Tablestore uses a feature from its engine to skillfully solve this problem. It builds indexes for existing data and incremental data, and it can build the LSIs on the existing data without stopping the online writing. In addition, after the existing data indexes are built, the entire indexing process ends without affecting the business layer.


Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.