Five Ways Alibaba Cloud OSS Helped DingTalk Cope with Traffic Peaks

In order to win this inevitable battle and fight against COVID-19, we must work together and share our experiences around the world. Join us in the fight against the outbreak through the Global MediXchange for Combating COVID-19 (GMCC) program. Apply now at

By Alibaba Cloud Storage

“Backed by its capabilities in elastic scaling, cross-province disaster recovery, multi-tenant management, and transfer acceleration, Object Storage Service (OSS) has helped DingTalk switch write-in regions with one click, distribute businesses to multiple regions, and increase the speed of cross-region concurrent image processing and document preview by tenfold during the prevention and control period of the coronavirus disease (COVID-19).” — Jin Xi, DingTalk senior technical expert

Just like you cannot predict a storm is coming because tree leaves are rustling, we could not predict the arrival of an epidemic from the news of a virus. The COVID-19 epidemic, which quickly spread around the world, has affected many offline businesses. To actively cope with this epidemic, governmental organizations, enterprises, and schools have implemented work-from-home policies. Video conferences, online education, and Internet-based data analysis have become the most important rigid demands of the moment. For example, 200 million office workers from 10 million enterprises and organizations have been working collaboratively on the DingTalk platform.

With such a huge amount of traffic and incremental data, DingTalk uses Alibaba Cloud Object Storage Service (OSS) to quickly scale up the storage capacity, which ensures business continuity and multi-tenant data segregation. The following five Q&As explain how OSS helps DingTalk to cope with traffic peaks.

Intelligent and Highly Accessible Cloud-based Storage

Question: Why is OSS used?

Answer: OSS is a storage product that applies to Internet businesses.

As an online collaboration platform, DingTalk needs cloud storage for live streaming, Ding Drive, and image processing. OSS that initially considers storage as a service provides a new and epoch-making solution for storage of large-scale and fast-growing Internet data. Using Internet data storage ignored by traditional storage products as the breakthrough point, OSS is developed for the Internet and mobile Internet (3G, 4G, and 5G) and designed for scenarios such as the storage of web pages, videos, images, audios, and documents. Technically, OSS uses the Internet-based HTTP/HTTPS Simple Storage Service (S3) or the OSS API to access network content through the Internet or mobile Internet. It provides global and network-wide shared data pool management for applications and is suitable as the underlying platform for Internet applications such as short videos, images, and music. OSS allows storage and quick access to a massive volume of data. With OSS, you can not only build a unified data analysis platform but also exploit data value, making storage smarter.

Why is Tablestore used?

Alibaba Cloud Tablestore (OTS) is a structured big data storage platform that offers high performance, low costs, scalability, and full management, based on shared storage. It supports efficient computing and analysis of Internet and IoT data.

OTS carries the traffic of multiple Alibaba departments and fields, including Alibaba Cloud, Cainiao, Fliggy, Ant Financial, Tmall, DingTalk, IoT, and AI Labs.

It has stood the test of the Double 11 Shopping Festival for many years.

It uses compute-storage separation in the technical architecture and product layer.

It uses the Alibaba Cloud in-house distributed multi-model database.

Deeply combined with the computing engine, it has an extensive computing ecosystem.

Against the background of industries embracing the Internet, a new Internet-based technical system is required for infrastructure cloudification. With the digital transformation of enterprises and organizations and the rapid development of the Industrial Internet, a technology upgrade solution that is more applicable to Internet scenarios is needed. Alibaba Cloud has had a unique Internet gene since its foundation. By constantly addressing the challenges of the Internet era, Alibaba Cloud has built the underlying paradigm and technical context in the digital economy era.

Scalability of OSS

Question: How can OSS, a stateful product with massive volumes of data stored, be quickly scaled to meet DingTalk business needs?

Answer: Region-level quick resource scheduling.

OSS splits every object into metadata and data, and stores them separately to implement quick resource scheduling at the region-level. When users have burst bandwidth or queries per second (QPS) requirements, but the cluster where old data is located cannot meet the requirements, OSS schedules new data written to a bucket based on a prefix or proportion among different clusters or zones in a region to quickly meet users’ requirements.

Will metadata and data separation affect the efficiency of users’ ListObject operations or fail to ensure strong consistency? The answer is no. OSS saves the metadata of all objects in a bucket to the same cluster to ensure the high-performance and strong consistency of ListObject operations. Users also do not need to worry that a single cluster cannot store the metadata of massive objects because the efficient KV index layer of OSS ensures the horizontal scaling of the meta processing capability. In actual production, a single bucket can already have more than 1 trillion objects.

Another issue to be solved by region-level resource scheduling is users’ access modes. Newly written data can be quickly scheduled at the region level. However, users’ access modes and the ratio of accessed new and old data determine how many user access requests will be sent to the new cluster. OSS can migrate old data between clusters and zones without being perceived by users. However, when the existing data amount is large, migration takes a long time. As a result, users’ read bandwidth requirements cannot be flexibly scheduled. OSSBrain can resolve this problem. Its multidimensional user profile capability can quickly determine users’ access features and predict how many reads will be changed based on write adjustment to better schedule resources. For example, a DingTalk service needed to quickly support several times their previous access bandwidth during the epidemic. The OSSBrain analysis result shows that more than 90% of the data read by the service was written within 30 minutes. Based on this feature, OSS quickly distributed newly written data into multiple clusters to meet the bandwidth requirements of this service.

Disaster Recovery Capabilities

Question: Disaster recovery is critical to infrastructures. How can DingTalk quickly build a disaster recovery system across four provinces based on the OSS disaster recovery capabilities and ensure continuous business operation?

Answer: Cross-region replication and image-based back-to-origin of OSS ensure region-level disaster recovery.

OSS cross-region replication allows users to quickly build cloud storage services with the region-level disaster recovery capability. The following figure shows three data centers built based on OSS. Users write data to the primary data center and use OSS cross-region replication to continuously synchronize the data to the secondary data centers. With the OSS image-based back-to-origin feature, all the three data centers support read operations. In actual deployment, the architecture needs to be adjusted to ensure that objects are not overwritten or the eventual consistency of objects is acceptable.

OTS supports the replication mechanism. If the same change is applied to two databases in order, the same final result can be obtained. Real-time data consumption channels ensure unified full and incremental data synchronization to meet requirements in disaster recovery scenarios.

During the epidemic, DingTalk uses OSS deployed in a total of four regions, including China (Zhangjiakou-Beijing Winter Olympics), China (Shenzhen), China (Shanghai), and China (Chengdu), due to the sharp increase of resource requirements. This ensures that when OSS becomes faulty in a region, OSS in other regions can take over all the service requirements.

Performance Assurance through Multi-Tenant Segregation

Question: As a shared cloud service by nature, OSS serves many gaming, live streaming, and education customers in China in addition to DingTalk. During the epidemic, the resource demands of many customers have increased sharply. How can OSS ensure that users do not affect each other while sharing the OSS massive resource pool?

Answer: Continuous segregation of online multi-tenant data.

OSS is a shared service by nature. Hundreds of thousands of customers share massive resource pools of OSS, therefore, it is especially important to isolate resources among tenants. Thanks to the continuously online bandwidth, QPS, and CPU QoS capabilities of OSS, burst access requests from live-streaming and educational customers during the epidemic did not affect each other. The following figure shows the read bandwidth monitoring chart of a customer who required high bandwidth. The chart shows that OSS continuously kept the customer’s bandwidth at a preset value.

High Speed Access to Files and Documents

Question: During the epidemic, many enterprises often work in multiple locations and even across oceans. How can DingTalk ensure quick document sharing and preview?

Answer: OSS transfer acceleration.

Backed by the data centers distributed worldwide, OSS resolves and routes users’ requests for accessing DingTalk storage space (bucket) to their nearest data center and provides the optimal networks and protocols for accessing the bucket, to implement transfer acceleration.

OSS transfer acceleration can accelerate DingTalk file upload and download. It is preferred especially for uploading large files (GB or TB level).

With the preceding five features, OSS provides stable and secure infrastructure services for DingTalk and other customers during the epidemic. Digital services such as cloud computing play an important role in this epidemic. In the era of the Industrial Internet, 5G, and AI, online collaborative work represented by DingTalk will enter the fast lane of development, which is an inevitable trend of the development of the industry. After the epidemic retreats, DingTalk will still be the booster for the efficiently collaborative operations of enterprises, and cloud computing is an important fuel for this booster.

Bonus: How does DingTalk ensure the simultaneous online migration of 1 billion messages every second?

DingTalk is an IM application, and its core is the messaging system. The messaging system needs to ensure message synchronization and data storage. To ensure that all messages are fully processed in order by the receiving end, a new SN or ID must be set for each message sent by users, and the SN or ID must increase incrementally. A common architecture requires a global auto-increment ID generator, queue service, and historical message data storage service.

To ensure the global and strict incremental increasing of message SNs or IDs, data is locked during the running process. If the volume of messages sent to a user is large, messages to other users in the same queue may be blocked, resulting in latency.

Tablestore (OTS) provides lightweight message models to store massive queues with unlimited topics, and supports auto-increment columns. Through simplifying components in the original architecture and providing unified message synchronization and storage, OTS has a lightweight system that facilitates O&M. Auto-increment operations in the new architecture are processed within the OTS system, which ensures serial message writing without the previous restriction of writing to a single queue. A service system can be provided to support concurrent read/write operations with millions of TPS and PB-level message storage.

How can burst access peaks and hotspots be processed?

The range partition policy provides the solution, which partitions tables based on their partition key ranges. Partitions are evenly scheduled to hosts based on the partition size and access frequency. For example, when the backend detects an access hotspot in a chat group, the system splits access requests on a host and re-schedules these requests to multiple hosts, ensuring automatic load balancing and meeting the high concurrency requirement of the business.

While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store