How Alibaba Overcame Challenges to Build a Service Mesh Solution for Double 11

Deployment Architecture

Before getting fully into the topic we’d like to discuss with you in this article, let’s first take a look at the deployment architecture of Alibaba’s core applications for the big annual Double 11 shopping event. The core applications are shown in the following graphic below. In this article, we will mainly focus on the Mesh implementation for the remote procedure call (RPC) protocol between Service A and Service B, which is shown below.

Challenges

Our selected core applications for Double 11, which were all implemented in Java, faced the following challenges:

1. Implementing Mesh on Applications without Upgrading the relevant SDK

When we decided to implement Mesh in the core applications for Double 11, the RPC SDK version that our Java applications depended on had been finalized. As such, there was no time to develop and upgrade the RPC SDK to make it suitable for Mesh. So, the team faced had to figure out how exactly they would implement the Service Mesh in each application without upgrading the related SDK.

2. Supporting Complex Service Governance Functions for E-Commerce Business in a Short Window of Time

Routing

The various computing and networking scenarios involved in Alibaba e-commerce business include a wide range of routing features. In addition to support for unitization, environment isolation, and other routing policies, it’s important that we also perform service routing based on the method name, call parameters, and application name of RPC requests. Alibaba’s internal Java RPC framework supports these routing policies by running an embedded Groovy script. Users can configure a template Groovy routing script in the O&M console. When the SDK initiates a call, this script is executed to apply the routing policy.

Throttling

For performance-related reasons, the Service Mesh solution implemented at Alibaba does not use the Mixer component in Istio. The throttling function takes advantage of the Sentinel component widely used within Alibaba and by several other companies in China. This not only works with open-source Sentinel, but also reduces the migration costs of Alibaba’s internal users, because it is directly compatible with their existing throttling configurations. To facilitate Mesh integration, multiple teams developed the C++ version of Sentinel. Here, the entire throttling function is implemented through Envoy’s Filter mechanism. We have built the corresponding Filter based on the Dubbo protocol. In Envoy, this term indicates an independent functional module that processes requests. Each request is processed by Sentinel Filter. The configuration information required for throttling is obtained from Nacos through Pilot and delivered to Envoy through the xDS protocol.

3. Handle Too High Envoy Resource Overhead

One of the core problems to be solved when Envoy was developed was the observability of services. Therefore, Envoy has embedded a large amount of statistics from the very beginning to better observe services.

4. Decouple the Business and Infrastructure to Upgrade Infrastructure Without Affecting Businesses

One of the core advantages of Service Mesh implementation is that it completely decouples infrastructure and business logic so that they can evolve independently. To realize this advantage, Sidecar needs to incorporate the hot upgrade capability to avoid business traffic interruption during the upgrade. This poses a major challenge to the solution design and technical implementation.

Data Performance

Incautious publication of performance data can lead to controversy and misunderstandings because there are many variables that can affect performance data in different scenarios. For example, the concurrency, queries per second (QPS), and payload size have a critical impact on the final data performance. Envoy has never officially provided the data listed in this article because the author, Matt Klein, was worried about causing misunderstandings. It is worth noting that, due to time constraints, our current Service Mesh implementation is not optimal and does not represent our final solution. For example, two routing problems exist on the consumer side. We chose to share this information to show our progress and current status.

Outlook

With the emergence of cloud-native technology, Alibaba has been committed to building a future-oriented technological infrastructure that is based on cloud-native technologies. In the development of such an infrastructure, we started out by working with several different open-source products, and our focus has been and continues to be on open source, having produced several open-source solutions ourselves. We hope to popularize our technology through providing our own solutions to the larger community and popularizing cloud-native technologies in general.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com