How to Ensure a Pleasant Chat between Hundreds of Millions of Buyers and Sellers during Double 11
Step up the digitalization of your business during the Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!
Double 11 Global Shopping Festival has come and gone. This year’s Double 11 had hundreds of millions of buyers and sellers. The success of Double 11 was only possible with the strong support from the Alibaba Cloud Product Technical Team. Tablestore, of Alibaba Cloud Storage Team, also provided support.
Ten months ago, DingTalk faced the rigid online office requirements caused by the sudden outbreak, posing significant challenges to the cost, maintenance, and stability of existing storage systems. After a lot of research, DingTalk chose Tablestore from the Alibaba Cloud Storage Team as the unified storage platform. Through the close cooperation between the DingTalk Team and the Tablestore Team, all of the message synchronization data and full message data of the IM system of DingTalk and Multi-tenant PaaS was migrated. During the 2020 Double 11 Shopping Festival, Tablestore provided full support for the Instant Messaging (IM) of Alibaba Group for the first time to ensure smoother communication between hundreds of millions of buyers and sellers.
This article describes the IM storage architecture of DingTalk and a series of works done by Tablestore to meet migration requirements in terms of stability, functionality, and performance.
1. The Message Systems of Tablestore and DingTalk
Alibaba Cloud Tablestore is an online data platform that can store, search, and analyze large amounts of structured data. It can scale up to PB-level storage for a single table and provide a service capacity of hundreds of millions of queries per second (QPS). Tablestore was developed by Alibaba Cloud Intelligence. It features strong performance, low costs, easy scalability, full-hosting, high reliability, high availability, hierarchical storage of cold and hot data, and flexible computing and analysis. It has rich practical experience and accumulation, especially in metadata, big data, monitoring, message, Internet of Things, trace tracing, and other types of applications. The following figure shows the product architecture of Tablestore.
DingTalk Message Systems
The DingTalk message systems include the DingTalk IM system and the IMPaaS system. The DingTalk IM system carries the IM services. As a multi-tenant IM platform, IMPaaS can be used by multiple teams within the Alibaba Group. Currently, it has supported major IM services in Alibaba Group, such as chat on DingTalk, the customer service chat on Taobao Mobile, Tmall, Qian Niu, Eleme, and AMAP. The two systems have the same architecture, so they will not be differentiated in this article. For convenience, they are described as the IM systems. The IM system is responsible for receiving, storing, and forwarding individual and group messages. It also supports session management, read notification, and status synchronization. The IM system is responsible for end-to-end communication.
2. Upgrading of Storage Architecture
The IM system involves storage systems in the architecture, including DB (InnoDB, X-Engine) and Tablestore. The DB is used to store sessions and messages. Tablestore is used for the synchronization protocol that is responsible for the message synchronization push. During the outbreak, the number of DingTalk messages increased rapidly. This made the DingTalk Team realize that using the traditional relational database engine to store streaming messages had great limitations in terms of write performance, scalability, and storage costs. The distributed NoSQL with the log-structured merge (LSM) architecture is a better choice.
The preceding figure shows the original architecture of the IM system. In this system, a message is stored twice after it is sent:
- Full Message Storage: Step 4 in the figure shows the message stored in DB for persistence. This message is mainly used to call the system to display the homepage information of the app. Full messages are permanently retained.
- Synchronous Protocol Storage: As shown in Step 5, the message is stored in Tablestore and then read, merged, and pushed to users. The data of synchronous protocol storage is retained for a specified number of days.
After the architecture upgrades, the dependency on DB is removed from full message storage, and all data is written into Tablestore. After the upgrading, the IM system storage depends only on Tablestore, bringing the following benefits:
- Low Costs: The architecture of Tablestore based on the LSM storage engine is superior in terms of costs compared with the architecture of database and table sharding. The Erasure Code (EC) enabled for data storage, hot and cold separated storage, and the pay-as-you-go billing method of cloud services can reduce the costs of DingTalk. Based on the current billing methods, migrating storage data to Tablestore could save DingTalk over 60% on storage costs.
- Strong Elasticity and Scalability: Any number of machines can be scaled out as needed, and the scale-out speed is fast without affecting the business. After the machines are prepared (including system cloning), the scale out can be completed in minutes. After handling the peak cluster traffic, container service allows for rapid scaling in to save resources.
- Zero O&M: Tablestore is a fully managed cloud service that does not require users to undertake any O&M work. Tablestore can ensure uninterrupted services during O&M. Tablestore uses a schema-free architecture, so users do not have to worry about the table structure adjustment caused by changes in business requirements.
3. Stability Assurance
Stability is the foundation of everything and the biggest concern for selecting a business. In a complex distributed system, every component may have problems, ranging from broken optical cable in the data center to Bit Flip in the network interface controller (NIC). Therefore, every missed problem may cause catastrophic failure and trigger serious public opinion risks. The core of the stability effort is to fully consider these possibilities and provide fault tolerance as much as possible. As a basic product of Alibaba Cloud Intelligence, Tablestore is deployed in all service areas of Alibaba Cloud worldwide, ensuring its stability. While providing services for DingTalk, it has been specially enhanced in stability.
3.1. Disaster Recovery Based on Primary and Secondary Clusters
To avoid service unavailability caused by a fault on a single data center, Tablestore supports disaster recovery based on primary and secondary clusters. Primary and secondary clusters are two independent clusters. Data is replicated in the asynchronous background mode. This disaster recovery method cannot ensure strong data consistency and causes a delay of several seconds. When the primary cluster fails in a single data center, all traffic is switched to the secondary cluster, which provides services to ensure service availability.
3.2. Highly Consistent Disaster Recovery Provided by 3AZ
Although primary/secondary disaster recovery can solve the disaster recovery problem of a single data center, strong data consistency cannot be achieved when the primary cluster is down unexpectedly and the switchover process requires coordination with the business (tolerating data inconsistency for a time and subsequent data patching.) The three available zone (3AZ) disaster recovery modes can solve these problems.
The 3AZ disaster recovery deployment method is to deploy the system on three physical data centers. When writing data, the Apsara Distributed File System evenly distributes the three copies of the data across the three data centers. This can ensure that when the failure occurs in any data center, all of the data in the other two data centers must be available to prevent any data loss. Each time Tablestore returns a successful data write, three copies are written to disks in three data centers. Therefore, the 3AZ architecture ensures strong data consistency. When one of the data centers fails, Tablestore schedules the partitions to the other two data centers to continue to provide services. The 3AZ deployment mode has the following advantages over the primary/secondary clusters:
- Low Costs: Three copies of the underlying Apsara Distributed File System are evenly distributed in three data centers, while one copy of data is required for the primary/secondary clusters. EC can be used later to reduce costs.
- Strong consistency of data between data centers (RPO = 0).
- Smoother Switching: Since the data is strongly consistent, the business does not need additional processing logic during switching.
3.3. Load Balancing System
In Tablestore, table storage is logically divided into partitions for high-scalability. The table is partitioned (range partitions) based on the partition key range, and these partitions are scheduled to different machines to provide services. To achieve optimal throughput and performance of the entire system, you need to balance the load of each machine in the system. In other words, the access volume and data volume distributed on each machine need to be balanced.
During system operation, design or service issues often lead to unbalanced data access and distribution, such as data skew and local hotspot partitioning. There are two big problems to be solved:
- How can these problems be discovered automatically?
- How can these problems be handled automatically after discovery?
The emphasis on automation here is mainly because Tablestore, as a multi-tenant system, runs a large number of instances and tables. If it relies on people for feedback and processing, it is not only costly but also has no operability. To solve this problem, Tablestore has designed its own load balancing system. There are three main points:
- It collects detailed statistics on partitions for subsequent analysis.
- It computes split nodes based on traffic and the discovery of hotspots.
- Multi-dimensional fine-grained grouping is mainly used to manage partition distribution.
On this basis, the load balancing system has also developed dozens of strategies to solve various types of hot issues.
This load balancing system can automatically identify and process 99% of hot issues that occur in clusters. It provides the capability to automatically resolve problems in minutes and locate and process problems in minutes. The system is currently running in all domains.
3.4. Perfect Traffic Control System
Tablestore provides a complete set of traffic control systems, which are listed below:
1. Instance-Level and Table-Level Global Traffic Control: The purpose is to solve the problem when the traffic exceeds the expectation and prevents the cluster traffic from exceeding the limit. As shown in the following figure, the traffic control architecture is on the left, and the traffic control model is on the right, where access traffic can be controlled at the instance and table levels, respectively. If the instance-level traffic exceeds the upper limit, all of the requests of the instance will be affected. You can also set the maximum table traffic at the table level to avoid affecting other tables when the traffic volume of a single table exceeds the upper limit. Based on this system, Tablestore can precisely control the traffic of the specific instances, tables, and operations, provide multi-dimensional personalized control measures, and guarantee the Service Level Agreement (SLA) of services. During the early stages of the pandemic, the cluster traffic of DingTalk exceeded the upper limit many times, and the global traffic control solved this problem.
2. Active Partition Traffic Control: Tablestore partitions data by range. When business data is skewed, reading and writing the partition hotspots often occur. The active partition traffic control mainly solves the single-partition hotspot problem to prevent single-partition hotspots from affecting other partitions. Access information about each partition is counted during the process, including but not limited to access traffic, the number of rows, and the number of cells.
The traffic control system implements a multi-level defense and guarantees service stability.
4. Extreme Performance Optimization
Tablestore is a typical architecture that separates storage and computing resources. Tablestore has many dependent components and complex request links. Therefore, performance optimization cannot be limited to a specific module. It is necessary to have a global perspective to optimize the components on the request links to achieve better full-link performance.
In the DingTalk scenarios, after a series of optimization, under the same SLA and hardware resources, the read/write throughput was improved by more than three times, and the read/write latency was reduced by 85%. The improvement in read/write performance also saves more machine resources and reduces business costs. The following is a brief introduction to the main performance work.
4.1 Storage and Network Optimization
On the core datalinks of Tablestore, there are three main system components:
- Tablestore Proxy: As the unified access layer for user requests, it authenticates requests, verifies their validity, and forwards them. After all the preceding checks are performed, the system forwards the requests to the Table engine layer.
- Table Engine: The Table engine uses an LSM tree as its processing model to provide distributed table capabilities.
- Apsara Distributed File System: It uses a distributed file system designed to persist data during the data processing of the Table engine.
In the request links, the first-hop network (Tablestore proxy to Table engine) and the second-hop network (Table engine to Apsara Distributed File System) are based on the kernel tcp network framework. Data will be copied, and the system CPU will be consumed multiple times. In terms of storage network optimization, the Luna, the new generation of user-state Remote Procedure Call (RPC) framework, is used as the data transmission mode for the first hop, reducing the data copy times of the kernel and the system CPU consumption. In the second-hop of data persistence, the Remote Direct Memory Access (RDMA) network library provided by the Apsara Distributed File System is used to further reduce data transmission latency.
As a storage base, the Apsara Distributed File System 2.0 is applied to the next-generation networks, storage software, and hardware for architecture design and engineering optimization. It releases the dividend of the development of software and hardware technology. It achieves great performance improvements on the data storage links, and the read/write latency has reached 100 microsecond-level. Based on the optimized architecture of the storage base, the Table engine has upgraded its data storage path, and has been connected to the Apsara Distributed File System 2.0 to connect to the I/O path.
4.2. Iterator Optimization
The table model of Tablestore supports a wide range of data storage semantics:
- Multiple-versions of the data column
- Schema-free of columns, which is a wide table model
- Flexible deletion semantics, such as row deletion, column multi-version deletion, and specific column version deletion
- Expiration of Time To Live (TTL) data
During the implementation of each strategy, to achieve fast iteration, the iterator system is implemented in the form of a volcano model (as shown in the figure on the left half.) One problem caused by this method is that the iteration level gets deeper as the strategy gets more complex. Even if the strategy is not used, it will be added to the system. The dynamic polymorphism in this process results in the high costs of function call. From the perspective of R&D, it is difficult to form a unified perspective, which also brings obstacles to the follow-up optimization.
To optimize the data iteration performance, the context information required by each strategy is assembled into a strategy class, and the policy-related iterators are flattened from a global perspective. Thus, the performance loss of a nested iterator level in reading links is reduced. In multi-channel merging, the multi-column primary key jump capability provided by the column storage is used to reduce the comparison costs of multi-column merging and sorting, improving data reading performance.
4.3. Lock Optimization
In addition to storage format improvement, one of the major problems in memory processing is the performance loss caused by lock competition. Locks are ubiquitous in all modules of the system. Apart from explicit locks for synchronization or concurrent critical sections, there are also optimistic lock performance problems for atomic operations that are easily ignored, such as the global metric counters and the Compare and Swap (CAS) instruction conflicts. The memory allocation of small objects may also cause a serious lock conflict in the underlying memory allocator. The lock structure has been optimized in the following aspects:
- In Implicit Transactions: The exclusive lock is converted into a read/write lock. Only requests that require transaction assurance are serialized through a write lock. Requests that do not require transactions on regular paths are parallelized through maximum read locks.
- In the Read Path: Locks are also introduced into the least recently used (LRU) function of cache for critical section protection. In high concurrent reading, the LRU strategy can easily cause lock competition. Therefore, a lock-free LRU cache policy is implemented, which improves the read performance.
5. Practice-Based Functional Innovation
As a NoSQL product developed by Alibaba, Tablestore gives full play to its advantages in the cooperation with DingTalk and develops multiple new features in combination with DingTalk services, further reducing CPU consumption, optimizing performance, and upgrading services without affecting the business.
5.1. Primary Key Auto-Increment +1 Function
The DingTalk synchronization protocol uses the primary key auto-increment function of Tablestore. This function returns an auto-increment ID after a user successfully writes a row of data. A brief introduction of how DingTalk uses the auto-increment ID to realize synchronous message push is listed below:
- If the user sends a message and the application server receives the message, the application server writes the message to Tablestore. If the message is successfully written, the auto-increment ID of the message will be returned.
- The application server queries all messages between the two push IDs based on the message ID that has been pushed last time. Then, the server pushes the queried message to the user.
One special scenario is that the user has only one unpushed message. If the returned message ID is guaranteed to be auto-incremented, it also ensures that the number is automatically increased by one. Then, after the application server writes a message, if it finds that the returned ID is the only one that is different from the ID pushed last time, it does not need to query from Tablestore and can directly push the message to the user.
With the introduction of the “primary key auto-increment +1 function,” the number of application-side reads has been reduced by more than 40%, and the CPU consumption of Tablestore servers and your own business servers has been reduced. Therefore, the business costs for DingTalk have also been reduced.
5.2. Local Secondary Index (LSI)
Prior to this, Tablestore already had the asynchronous global secondary index (GSI) feature. However, in the DingTalk scenarios, most queries are based on the data of a user. Indexes only need to take effect on the data of a user. To address this, we have developed a highly consistent LSI feature and built dynamic indexes for existing data.
In the scenario below, users can directly find all of the messages that mention them.
For the business side, a user message table stores all of the messages of the user. In this table, an LSI table of the latest messages that mention the user is created. Then, the user can scan the LSI table to obtain all the messages that mention the user. The two table structures are shown in the following figures:
- Building Index of Incremental Data
For the building of incremental data, we examined each module of the full link and made extreme performance optimization. It takes about 10us to create a row of an index:
- By analyzing the schema of the table and comparing it with the data written by the current user, it can save unnecessary reading before writing.
- By recording the original value before data is updated to the log, the amount of I/O data written in logs after the index is successfully built.
- By batch indexing, all index rows of the index table are computed in one building process.
- By computing before the log is entered, the multi-thread parallel capability is fully utilized.
- Building an Index of the Existing Data
The building of existing data is a challenge in the industry since LSI requires consistency. If an index is written successfully, it can be read from the index table. At the same time, during the building index of existing data, we hope to minimize the impact on online writing since writing can’t be stopped.
As shown in the following figure, traditional databases do not solve this problem properly. During the conversion from the building of existing data to incremental data, tables need to be locked, which forbids user writing, resulting in service suspension. DynamoDB does not support the construction of LSIs, including the existing data on the table.
Tablestore uses a feature from its engine to skillfully solve this problem. It builds indexes for existing data and incremental data, and it can build the LSIs on the existing data without stopping the online writing. In addition, after the existing data indexes are built, the entire indexing process ends without affecting the business layer.
The process of digital transformation is not “changing the engine while driving the aircraft,” but “changing the engine when the aircraft is accelerating.” During rapid business development, ways to ensure the smooth and efficient upgrading of the storage system and the whole business system will be a problem that every Chief Information Officer (CIO) needs to solve.