Technical Systems Behind Alibaba Cloud OSS’s Leading SLA

12 min readAug 24, 2020

1) Overview

In June 2020, Alibaba Cloud Object Storage Service (OSS) increased the guaranteed availability in its service level agreement (SLA) by 1000%. We were able to do this because of the technical expertise we accumulated over more than a decade. Our new guaranteed availability marks a global first and is 10 to 20 times higher than other cloud vendors, as shown in the following figure:

For Standard Zone-Redundant Storage (ZRS) in OSS, the availability provided in the SLA was increased from 99.95% to 99.995%. This means we will pay compensation if server errors are returned for five out of 100,000 requests.

2) SLA for OSS

2.1 Common Availability Metric (Annual Failure Time)

Annual failure time is a common way to describe availability in the industry. Data centers are assigned different levels, T1 to T4, which have the following availability metrics:

T1 data centers have a service availability of 99.671% and an average annual failure time of 28.8 hours.
T2 data centers have a service availability of 99.741% and an average annual failure time of 22 hours.
T3 data centers have a service availability of 99.982% and an average annual failure time of 1.6 hours.
T4 data centers have a service availability of 99.995% and an average annual failure time of 0.4 hours.

The availability of a network service is usually represented by the service unavailability duration. For example, 99.999% availability indicates an annual failure time of about five minutes. The following table below describes the annual failure time for each availability level.

For an instance-based cloud service that provides compute instances, such as Alibaba Cloud Elastic Compute Service (ECS), the service availability is directly related to the available time. Therefore, the service availability of the service is also defined by the annual failure time.

2.2 SLA Metrics for OSS

As a cloud-based resource access service, OSS provides serverless API calls instead of instances. Therefore, we cannot calculate the service availability of OSS by the annual failure time. Alibaba Cloud OSS uses the error rate (the number of failed requests as a proportion of the total number of requests) to calculate the service availability.

2.2.1 Error Rate Per Five Minutes

Error rate per five minutes = Number of failed requests in five minutes/Total number of valid requests in five minutes × 100%

Using a longer time interval to calculate a request error rate makes cloud services look better. This is because a longer interval includes more requests, usually resulting in a lower error rate. Therefore, we calculate the error rate every five minutes to hold ourselves to a higher standard. Redundancy is the key to the design of a high-availability system. Five minutes is the typical troubleshooting time for machines in the industry because it enables quick machine restoration and a lower system error rate.

2.2.2 Service Availability Based on Error Rates Per Five Minutes in a Service Cycle

Service availability = (1 — Σ Error rates per five minutes in a service cycle/Total number of five-minute periods in the service cycle) × 100%

OSS charges monthly fees, so the service cycle is one calendar month. The service availability in a month is obtained by summing the error rates per five minutes in this service cycle, dividing the sum by the total number of five-minute periods in the service cycle (30 × 24 × 60/5 = 8640, for a month containing 30 days), and then subtracting the average error rate from 1.

According to this formula, if the error rate per five minutes is too high, the service availability declines. Therefore, improving the request success rate per five minutes is crucial to increasing availability.

2.2.3 Comparison of Annual Failure Time and Request Error Rate Approaches

If the annual failure time is 26 minutes, the resulting service availability is 99.995%. However, assume we use the request error rate approach for OSS and the error rate is calculated every five minutes. Say the error rate is 100% for at least the 5 five-minute periods that make up these 26 minutes and all other five-minute periods are assumed to be free of errors. Here, the maximum availability is calculated as 1–5 × 100%/8640 = 1–0.058% = 99.942%. Therefore, the request error rate approach for OSS is stricter.

2.2.4 SLA Commitment for OSS

Using the preceding formula, we can calculate the actual availability for one calendar month. According to the SLA of OSS, if the service availability requirement is not met, we will pay the promised compensation to improve service availability for customers.

2.3 SLA Comparison of Difference Object Storage Services

An analysis of the SLAs of cloud vendors such as AWS, Azure, GCS, Alibaba Cloud, Tencent Cloud, and Huawei Cloud shows that the availability provided by Alibaba Cloud OSS is 10 to 20 times higher than its competitors. In addition, Alibaba Cloud OSS adopts the most rigorous approach of error rates per five minutes. This exemplifies our “customer first” philosophy. Among the foregoing vendors, one public cloud manufacturer with roots in the traditional storage industry still calculates availability by available time, just as for traditional offline storage.

3) OSS Availability System Construction

Alibaba Cloud OSS is a cloud service based on the R&D we have done for more than a decade. We formulated the following availability system after making availability our core trait.

3.1 LRS and ZRS Architecture

Alibaba Cloud OSS provides Locally Redundant Storage (LRS) that is deployed in a zone and ZRS that is deployed in three zones. Sharing the same logical architecture, these storage types mainly include the following modules: Apsara Name Service and Distributed Lock Synchronization System, Apsara Distributed File System, OSS metadata (Youchao distributed key-value (KV) indexing), OSS servers, and network load balancing.

In terms of the physical architecture, ZRS offers disaster recovery at the data center level by distributing replicas of user data to multiple zones in a single region. In this way, when a data center is unavailable due to a fire, typhoon, flood, power failure, or network disconnection, we can still provide services with high consistency. In Alibaba Cloud OSS, neither service interruption nor data loss occurs during failover. This meets the strict demands of zero recovery time objectives (RTOs) and zero recovery point objectives (RPOs) for critical service systems. ZRS can provide 99.9999999999% data durability and 99.995% service availability.

3.2 IDC Redundancy Design

To achieve higher availability, we need a sound redundancy design at the physical layer. The following technologies are used:

Distance and latency design of ZRS in multiple zones: When we deploy a public cloud, we follow the design rules of Internet Data Centers (IDCs) and network architectures as well as the zone location standards of Alibaba Cloud. In particular, we emphasize latency and distances to conform to the design requirements of the multiple zones in OSS.
Redundancy for power supplies and cooling systems: As a cloud service deployed across multiple regions, OSS must cope with natural disasters, power supply problems, and air conditioner failures every year. This requires us to incorporate dual-channel mains supply, a diesel generator power supply, and continuous cooling capability in the design of IDCs.
Network redundancy: As a public cloud service, OSS needs to provide access for the Internet and Virtual Private Cloud (VPC) networks, in addition to internal network connections of the distributed system. Therefore, we need to properly design redundancy for both purposes.
External networks: Redundancy for Internet access is implemented by using Border Gateway Protocol (BGP) and static bandwidth to access multiple operators through the Internet. Redundancy for VPC access is implemented through Alibaba Cloud.
Internal networks: OSS is a distributed storage service where multiple servers are connected by an internal network. Specifically, redundancy is achieved by using a layered design for intra-cabinet switches, inter-cabinet switches, and inter-data center switches in an IDC. In this way, even if a network device fails, the system can still run normally.
Servers: OSS uses a distributed hybrid server family to maximize cost-effectiveness for users. Based on the requirements of distributed systems and software-defined storage, we use general-purpose servers (commodity servers) with redundant network interfaces rather than customized hardware with dual-control redundancy in traditional storage arrays.

3.3 Distributed System Design

3.3.1 Apsara Name Service and Distributed Lock Synchronization System

Established as one of the core modules at the underlying layer of Apsara in 2009, the system provides services including consistency, distributed locking, and message notifications. In performance, scalability, and O&M, it is superior to open-source software (such as ZooKeeper and etcd) with similar functions.

Apsara Name Service and Distributed Lock Synchronization System have a two-layer architecture where the backend module is used to maintain consistency while the frontend host is used for shunting.

The frontend host performs load balancing through virtual IP addresses (VIPs). This system maintains communication based on persistent connections among clients. This ensures that client requests can be balanced to the backend and hides the backend switching process from the clients. The system also efficiently delivers message notifications.
The backend is a Paxos group that consists of multiple servers with a consensus protocol core. Resources (such as files and locks) provided by a client are arbitrated by a homing Paxos group in the backend. A distributed consensus protocol in Paxos is used for synchronization to ensure the consistency and persistence of resources. To provide better scalability, the backend provides multiple Paxos groups.

This enables fast failover through multi-VIP redundancy, frontend transparent switching, and Paxos groups for redundancy consensus arbitration, providing high availability during consensus collaboration.

3.3.2 Apsara Distributed File System

As a second-generation distributed storage system developed by Alibaba, Apsara Distributed File System 2.0 has reached its full potential in performance, scale, and cost-effectiveness and can be accessed in more ways. It further enhances the automation and intelligence of system deployment and O&M, while inheriting the high reliability, high availability, and strong data consistency of Apsara Distributed File System 1.0.

3.3.3 Fast Switching of Youchao Distributed KV Metadata in OSS

A Youchao distributed KV system provides distributed KV metadata for OSS. As the earliest system developed by Alibaba Cloud, it has gained years of experience in large-scale clusters by serving in OSS. In 2014, it incorporated multi-instance redundancy by dividing KV pairs into partition groups composed of multiple replicas. In a partition group, a leader node is elected by using a consensus protocol to provide services to external entities. When the leader node fails or a network partition occurs, a new leader node can be quickly elected to take over the services for the partition. This feature improves the availability of OSS metadata, as shown in the following figure.

3.3.4 OSS QoS

The service layer of OSS focuses on data organization and function implementation. Due to the distributed capabilities of the underlying Apsara Distributed File System and Youchao, the OSS service layer is designed in a stateless manner so failover can be quickly implemented and availability can improve. However, due to the multi-tenant feature of OSS, quality of service (QoS) monitoring and isolation are the keys to ensuring availability for tenants.

3.3.5 Network Load Balancing

OSS is subject to a large number of access requests. Therefore, load balancing is implemented at the access layer. In load balancing, VIPs are bound to provide high-availability services and connect to frontend clusters of OSS. This enables quick failover upon the failure of any module to ensure availability. OSS offers high-traffic and high-performance access based on load balancing expertise from the Alibaba Cloud network product team.

3.4 Security Protection

Due to its HTTP- and HTTPS-based data access services, OSS is prone to attacks from the Internet and VPC networks, such as distributed denial-of-service (DDoS) attacks. Protection against attacks is crucial to ensuring OSS availability. One purpose of cyberattacks is to compromise the services of OSS, which reduces overall service availability.

Hackers can attack OSS by trying to use up OSS bandwidth (bandwidth congestion) or exhaust OSS computing resources (resource exhaustion). Potential attacks include network traffic-based attacks (L3 and L4 DDoS attacks) for bandwidth congestion, L4 CC attacks (link resources), and L7 attacks (application resources) for resource exhaustion. The following table classifies potential attack types.

3.5 Intelligent O&M

3.5.1 Storage Operations and Maintenance System

Storage Operations and Maintenance System, a management and control platform of Alibaba Cloud OSS, is intended for internal development, O&M, and operation users. It has been made available for five major services: OSS, File Storage NAS, Tablestore, Log Service (SLS), and Function Compute. For these services, it provides features such as real-time data monitoring, intelligent O&M management, rapid alert response, and security auditing. It also strives to empower security services with accuracy, efficiency, and intelligence.

To better manage OSS availability metrics and improve O&M capabilities, Storage Operations and Maintenance System is designed for monitoring and alerting, analysis and diagnosis, and problem resolution based on fault identification, location, and troubleshooting. It also provides monthly SLA management to monitor a monthly list of underperforming SLA metrics and determine the reasons for this underperformance. This allows us to continuously improve our SLA metrics.

3.5.2 OSS brain

OSS Brain is a smart O&M platform that aims to leverage data and algorithms to ensure OSS stability and enable online O&M and operation. It analyzes online data to provide intelligent decision-making services, including machine isolation, active online warning, user profiling, anomaly detection, resource scheduling, and user isolation. It implements agile intelligent O&M and fast error isolation to improve availability.

3.6 High-Availability Solution

OSS is a regional service and may be unavailable due to a regional fault. To offer higher service availability, OSS provides a high-availability solution for active geo-redundancy, as shown below:

Data is written only into a primary region to facilitate development. The data is replicated to a secondary region by using OSS’s cross-region replication feature so that all the data is available in the secondary region.
Therefore, data can be read from a nearby region to reduce latency. The data is asynchronously replicated to the secondary region after it is written only to the primary region. As a result, data replication may still be in progress when the user attempts to read the data from the secondary region. In this case, the user can pull from the origin to read the data from the primary region.

This enables quick failover in different regional fault scenarios, offering an RPO in seconds and ensuring service application continuity.

When the secondary region is out of service, the upper-layer services quickly failover to the other two regions, where traffic is evenly shunted to them. In this way, the services can be restored immediately.
If the primary region is out of service, a new primary region (for example, Region 2) is selected and cross-region replication from Region 2 to Region 3 is enabled. In this way, write requests can be switched to the new primary region, and read requests can be switched to the other region. OSS version control and zero data writing ensure the data consistency upon failover from the primary region.

3.7 Management Mechanism

OSS also provides the following management mechanisms to improve service availability:

Inventory management: Public cloud services are asset-heavy. Self-management of the supply chain inventory, intelligent prediction of resource demands, and on-demand service provisioning are the foundations of service availability.
Resource usage management: OSS is a cloud storage service that monitors the usage of capacity, bandwidth, queries per second (QPS) capacity, and other resources to perform dynamic and intelligent scheduling and improve system availability.
Culture of stability: We hold ourselves to high stability standards in development, design, testing, and O&M.

We provide services in support of Double 11, serving millions of users. Coping with the Double 11 traffic peaks for many years, OSS has continuously improved its product architecture, features, and stability. Furthermore, OSS successfully serves millions of users from Alibaba Cloud’s public cloud service system, handling the loads from various industries. Based on our years of experience, we have formulated a mechanism for continuous availability improvement.

4) Future Work on Service Availability of OSS

Although OSS has improved its SLA availability 10 times over, we must continue to improve our availability in scenarios such as abnormal upgrades, super hotspots, and highly frequent attacks.

Learn more about Alibaba Cloud Object Storage Service by visiting the product page.