Sentinel Go 1.0 Release: High Availability Flow Control Middleware for Double 11

By Zhao Yihao (Suhe), director of the Sentinel open source project

Released by Alibaba Developer

Introduction

For example, in a production environment, you may have encountered a variety of unstable situations, such as:

• During big promotions, the instantaneous peak traffic may cause the system to exceed the maximum load limit, and the system crashes, resulting in the user unable to place an order.

• The traffic of unexpected hotspot items broke through the cache, and database was defeated, accounting for normal traffic.

• The caller is dragged down by an unstable third-party service, the thread pool is full, and calls accumulate, resulting in the whole call link stuck.

These unstable scenarios may lead to serious consequences, but we are likely to ignore high-availability protection related to traffic and dependency. You may want to ask: How can we prevent the impact of these unstable factors? How do we implement high-availability protection for traffic? How can we ensure that our services are “stable”? At this time, we must turn to Sentinel, a high-availability protection middleware used in Alibaba’s Double 11. During the Tmall Double 11 Shopping Festival this year, Sentinel perfectly ensured the stability of tens of thousands of Alibaba’s services during peak traffic, and the Sentinel Go 1.0 GA release was officially announced recently. Let’s take a look at the core scenarios of Sentinel Go and what the community has explored in the cloud native area.

Sentinel Introduction

Earlier this year, the Sentinel community announced the launch of Sentinel Go, which will provide native support for high-availability protection and fault tolerance for microservices and basic components of the Go language. This marks a new step for Sentinel towards diversification and cloud native mode. In the past six months, the community has released nearly 10 versions, gradually aligning the core high-availability protection and fault tolerance capabilities. Meanwhile, the community is continuously expanding the open-source ecosystem by working with dubbo-go, and other open-source communities such as Ant Financial Modular Open Smart Network (MOSN) community.

Recently, Sentinel Go 1.0 GA version was officially released, which marks the official start of the production phase. Sentinel Go 1.0 is aligned with the high-availability protection and fault tolerance capabilities of Java, including traffic throttling, traffic shaping, concurrency control, circuit breaking, system adaptive protection, and hotspot protection. In addition, this version covers the mainstream open-source ecosystems, providing adapters of common microservice frameworks such as Gin, gRPC, go-micro, and dubbo-go, and providing extended support for dynamic data sources, such as etcd, Nacos, and Consul. Sentinel Go is evolving in the cloud native direction. In V 1.0, some cloud native aspects have been explored, including Kubernetes Custom Resource Definition (CRD) data source, Kubernetes Horizontal Pod Autoscaler (HPA), etc. For the Sentinel Go version, the expected throttling scenario is not limited to the microservice application itself. The Go language ecosystem accounts for a high proportion of cloud native basic components. However, these cloud native components often lack fine-grained and adaptive protection and fault tolerance mechanisms. In this case, you can combine some component extension mechanisms and use Sentinel Go to protect the stability of your own business.

The bottom layer of Sentinel uses the high-performance sliding window to count metrics within seconds, and combines the token bucket, leaky bucket, and adaptive flow control algorithm to demonstrate the core high-availability protection capability.

So, how can we use Sentinel Go to ensure the stability of our microservices? Let us look at several typical application scenarios.

Core Scenarios of High-Availability Protection

1. Traffic Throttling and Allocation

In the scenario of web portal or service provider, the service provider must be protected from being overwhelmed by traffic peaks. At this time, the traffic throttling is often performed based on the service capabilities of the service provider or based on a specific service consumer. We can evaluate the affordability of the core interfaces based on the pre-stress testing and configure the traffic throttling rules of queries per second (QPS) mode. When the number of requests per second exceeds the set threshold, excess requests are automatically rejected.

The following is a simplest configuration example of Sentinel traffic throttling rules:

_, err = flow.LoadRules([]*flow.Rule{
{
Resource: "some-service", // Tracking point resource name
Count: 10, // If the threshold is 10, it is the second-level dimension statistics by default, which indicates that the request cannot exceed 10 times per second.
ControlBehavior: flow.Reject, // The control effect is direct rejection, the time interval between requests is not controlled, and the requests are not queued
},
})

2. Warm-Up Traffic Throttling

3. Concurrency Control and Circuit Breaking

Modern microservice architectures are distributed and consist of many services. Different services call each other, forming complex call chains. The preceding problems may be magnified in call chains. If a link in the complex chain is unstable, a cascade may eventually cause the unavailability of the entire chain. Sentinel Go provides the following capabilities to prevent unavailability caused by unstable factors such as slow calls:

Concurrency control (isolation module): As a lightweight isolation method, it controls the concurrent number of some calls (that is. the number of ongoing calls) to prevent too many slow calls from crowding out normal calls.

Circuit breaking (circuit breaker module): It automatically breaks the circuit of unstable weak dependency calls. It temporarily cuts off the unstable calls and avoids the overall crash caused by local instability.

Sentinel Go’s fault tolerance feature is based on the circuit breaker mode, which temporarily cuts off service calls, waits for a while and tries again, when an unstable factor occurs in a service (such as a long response time and a higher error rate). This not only prevents impacts on unstable services, but also protects service callers from crashing. Sentinel supports two break policies: Based on response time (slow call ratio) and based on error (error ratio/number of errors), it can effectively protect against various unstable scenarios.

Note that the circuit breaker mode is generally applicable to weak dependency calls, that is, circuit breaking does not affect the main service process, and developers need to design the fallback logic and return value after circuit breaking. In addition, even when service caller use the circuit breaking mechanism, we still need to configure the request timeout time on the HTTP or Remote Procedure Call (RPC) client for additional protection.

4. Hotspot Protection

Let’s look at another example. When we want to restrict the frequency at which each user calls an API, it is not a good idea to use the API name and userId as the tracking point resource name. In this scenario, we can use WithArgs(xxx) to transmit the userId as a parameter to the API tracking point and configure hotspot rules to restrict the call frequency of each user. Sentinel also supports independent traffic throttling of specific values for more precise traffic throttling. Just as other rules, hotspot traffic throttling rules can be dynamically configured based on dynamic data sources.

The RPC framework integration modules, such as Dubbo and gRPC, provided by Sentinel Go will automatically carry the list of parameters that RPC calls in the tracking points. We can configure hotspot traffic throttling rules for corresponding parameter positions. Note that if you need to configure the traffic throttling of a specific value, due to the type system restrictions, only basic and string types are supported currently.

Sentinel Go hotspot traffic throttling is implemented based on the cache termination mechanism and token bucket mechanism. Sentinel uses a cache replacement mechanism , such as Least Recently Used (LRU), Least Frequently Used (LFU), or Adaptive Replacement Cache (ARC) policy to identify hotspot parameters and the token bucket mechanism to control the PV quantity of each hotspot parameter. The current Sentinel Go version uses the LRU strategy to collect hotspot parameters, and some contributors in the community have submitted a Project Review (PR) to optimize the elimination mechanism. In subsequent versions, the community will introduce more cache elimination mechanisms to meet requirements in different scenarios.

5. Adaptive System Protection

The adaptive system protection strategy of Sentinel draws lessons from the idea of the TCP Bottleneck Bandwidth and Round-trip propagation time (BBR) algorithm, and combines several dimensions of monitoring indicators, such as the system load, CPU usage, portal QPS of services, response time, and concurrency. The adaptive system protection strategy balances the inbound system traffic with the system load, maximizing system stability while maximizing throughput. System rules can be used as a backup protection policy for the entire service to ensure that the service does not fail. This method is effective in CPU-intensive scenarios. At the same time, the community is also continuously improving the effect and applicable scenarios of adaptive traffic throttling in combination with automatic control theory and reinforcement learning. In future versions, the community will also introduce more experimental adaptive traffic throttling policies to meet more availability scenarios.

Cloud Native Exploration

1. Kubernetes CRD Data-Source

apiVersion: datasource.sentinel.io/v1alpha1
kind: FlowRules
metadata:
name: foo-sentinel-flow-rules
spec:
rules:
- resource: simple-resource
threshold: 500
- resource: something-to-smooth
threshold: 100
controlBehavior: Throttling
maxQueueingTimeMs: 500
- resource: something-to-warmup
threshold: 200
tokenCalculateStrategy: WarmUp
controlBehavior: Reject
warmUpPeriodSec: 30
warmUpColdFactor: 3

Kubernetes CRD data-source module link: https://github.com/sentinel-group/sentinel-go-datasource-k8s-crd

In the future, the community will further improve the definition of CRD files and discuss the abstraction of standards related to high-availability protection with other communities.

2. Service Mesh

3. Kubernetes HPA Based on Sentinel Metrics

Of course, the Sentinel-based elastic solution is not a panacea. It is only applicable to some specific scenarios, such as Serverless services with fast startup. The elastic solution cannot solve the stability problem for services that start slowly or in scenarios that are not affected by the capacity of the service. For example, the capacity of the dependent database is insufficient. The elastic solution may even aggravate the deterioration of services.

Let’s Start Hacking

In the development of this version, we have two new committers — @ sanxu0325 and @ luckyxiaoqiang, who bring the warm-up traffic throttling mode, Nacos dynamic data sources and a series of function improvements and performance optimizations. They are very active in helping the community answer questions and review code. Congratulations! In the future versions, the community will continue to explore and evolve towards the cloud native and adaptive intelligent directions. Everyone is welcome to join the contributor group to participate in the future evolution of Sentinel and create infinite possibilities. We encourage any form of contribution, including but not limited to:

• Bug fixes
• New features/improvements
• Dashboards
• Documents/websites
• Test cases

Developers can select interesting issues from the good first issue list on GitHub to participate in the discussion and make contributions. We will focus on the developers who actively participate in the contribution, and the core contributors will be nominated as committers to lead the development of the community together. You are welcome to exchange your questions and suggestions through GitHub issue, Gitter, or DingTalk group (group number: 30150716). Now start hacking!

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.