ChaosBlade x SkyWalking: High Availability Microservices Practices
By Ye Fei (GitHub Account: @tiny-x), an open-source community enthusiast and ChaosBlade committer, participating in promoting the ecosystem construction of chaos engineering in ChaosBlade.
Preface
In a distributed system architecture, it is difficult to assess the impact of a single fault on the entire system. This is because there are a wide variety of service components and intricate dependencies between services, and the request procedure is long. Imperfect basic services, such as monitoring alarms and log records, may also cause difficulties in fault response and troubleshooting. Therefore, it is a great challenge to build a highly available distributed system.
Chaos Engineering (CE) was developed specifically to overcome these challenges. Through injecting faults into the system in a controllable range or environment, developers can observe the system behavior and locate the system defects. By doing so, the ability and confidence to deal with chaos in a distributed system due to unexpected conditions are developed. Thus, the stability and high availability of the system are continuously improved.
The implementation process of Chaos Engineering is to formulate a chaos experiment plan, define steady metrics, make assumptions of fault-tolerant behavior in the system, execute chaos experiment, check steady metrics in system, and so on. For this reason, the entire chaos experiment process requires reliable, easy-to-use, and scenario-rich chaos experiment tools to inject faults. Complete distributed procedure tracking and system monitoring tools are also required. Tracking and monitoring tools aim to trigger the emergency response and early warning scheme, then to quickly locate the fault, and observe various system data metrics during the whole process.
In this article, we will introduce the chaos experiment tool ChaosBlade and the distributed system monitoring tool SkyWalking. We will also discuss about the high-availability microservices practices of ChaosBlade and SkyWalking through a microservices case.
Tool Introduction
ChaosBlade
ChaosBlade is a chaos engineering tool that follows the principles of chaos engineering experiments. It provides extensive fault scenarios to help distributed systems improve fault tolerance and recoverability. It can also implement the injection of underlying-layer faults, and ensure business continuity during migration to the cloud or Cloud Native system. ChaosBlade is easy-to-use, non-invasive, and highly scalable. It can continuously improve system stability and availability through fault injection within a controllable range or environment.
ChaosBlade is easy to use and supports a wide range of experiment scenarios, including the following:
- Basic resources: Experimental scenarios such as CPU, memory, network, disk, and process.
- Java applications: Databases, caches, messages, JVMs, and microservices. Any methods can be specified to be injected with fault.
- C++ application: Scenarios like injecting delay, variables, and tampered returned values to specified methods or rows of code.
- Docker container: Experimental scenarios such as disabling of containers, or CPU, memory, network, disk, and process in containers.
- Cloud Native platform: Experimental scenarios on Kubernetes such as the CPU, memory, network, disk, and process. Pod network and Pod disabling. The experiment scenario of container as shown above.
ChaosBlade encapsulates scenarios into individual projects according to the domains. This not only realizes the scenario standardization in the domain, but also facilitates horizontal and vertical scaling of scenarios. By following the chaos experiment model, chaosblade CLI is called in a unified way.
SkyWalking
SkyWalking is an open source APM system that includes monitoring, tracing, and diagnosis for distributed systems in Cloud Native architecture.
The core features are as follows:
- Analysis of services, service instances, and endpoint metrics
- Root cause analysis
- Service topology analysis
- Analysis of services, service instances, and endpoint dependencies
- Slow service and endpoint detection
- Performance optimization
- Distributed tracing and context propagation
- Detection of database access metrics and slow database access statements (including SQL statements)
- Alerts
Tool Installation and Usage
It is very easy and convenient to install and use ChaosBlade. Various ChaosBlade scenarios are uniformly called through chaosblade CLI. After downloading and decompressing the corresponding tar package, the blade executable file is then used to perform chaos experiments. For downloading tar package, see: https://github.com/chaosblade-io/chaosblade/releases
ChaosBlade Installation
The actual environment in this article is Linux-AMD64. Download the latest chaosblade-linux-amd64.tar.gz package and perform the following steps:
## Download
wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
## Decompress
tar -zxf chaosblade-0.9.0-linux-amd64.tar.gz
## Set environment variables
export PATH=$PATH:chaosblade-0.9.0/
## Test
blade –h
ChaosBlade Usage
After the installation, the blade executable file is only required to create chaos experiments for all supported scenarios. First, use the blade -h
command to check the service. After selecting the sub-command, the -h can be used downward to see the complete usage case and the detailed description of each parameter:
1) How to Use the Blade
The blade -h
command can be executed to check which commands are supported:
An easy to use and powerful chaos engineering experiment toolkitUsage:
blade [command]Available Commands:
create Create a chaos engineering experiment
destroy Destroy a chaos experiment
...
2) Create Experiment Scenarios
For example, to create a full-load CPU scenario, the blade create cpu fullload –h
command can be executed to check specific scenario parameters and select relevant parameters to run the command:
Create chaos engineering experiments with CPU loadUsage:
blade create cpu fullloadAliases:
fullload, fl, loadExamples:# Create a CPU full load experiment
blade create cpu load#Specifies two random kernel's full load
blade create cpu load --cpu-percent 60 --cpu-count 2
...Flags:
--blade-release string Blade release package,use this flag when the channel is ssh
--channel string Select the channel for execution, and you can now select SSH
--climb-time string durations(s) to climb
--cpu-count string Cpu count
--cpu-list string CPUs in which to allow burning (0-3 or 1,3)
--cpu-percent string percent of burn CPU (0-100)
...
3) Resume the Experiment
ChaosBlade supports three methods to resume an experiment as shown in the followings:
- After an experiment is successfully created, ChaosBlade returns a UID. The
blade destroy uid
command can be executed to resume the experiment. - Execute the blade destroy target action (such as “blade destroy cpu fullload”) if no corresponding UID is available.
- Add the “ — timeout 10” parameter when creating an experiment. The experiment automatically resumes after being executed for ten seconds. Besides, the parameter can act as an expression, such as “ — timeout 30m” for 3 minutes.
SkyWalking Installation and Usage
For more information about SkyWalking installation and usage, see: https://github.com/apache/skywalking/tree/v8.1.0/docs
After the tool is deployed, it’s time to introduce the building of a highly available microservices system by taking cases as references. Through fault injection and system behavior observation, ChaosBlade and SkyWalking can locate problems and discover system defects to build a highly available microservices system.
Case on Application Fault Tolerance
In this case, a microservice application is deployed for experiments, and A/B testing is used to simulate system requests. The microservice application services include the frontend, shopping cart, recommendation service, product, order, and so on. The components include Springboot, Nacos, MySQL, Redis, Lettuce, Dubbo are used. ChaosBlade supports most of the application components. ChaosBlade is used to inject chaos experiments to verify the application fault tolerance capability. SkyWalking is used for application monitoring and problem locating.
Case Environment
- Linux-AMD64, version CentOS-7.x
- JDK 1.8
- chaosblade-0.9.0. Download address: https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
- skywalking-apm-8.1.0. Download address: https://www.apache.org/dyn/closer.cgi/skywalking/8.1.0/apache-skywalking-apm-8.1.0.tar.gz
Application Topology
The overall architecture of the application is as follows. The frontend calls cart and products through the powerful dependency of Dubbo.
Chaos Experiment Steps
- Develop a chaos experiment plan
- Define system steady metrics
- Make assumptions about system fault tolerance behavior
- Run chaos experiment
- Check steady metrics
- Record and resume chaos experiment
- Fix the problems
- Automated Continuous verification
Next, a chaos experiment will be performed using ChaosBlade according to the steps.
Case 1
1) Scenario
Create a chaos experiment plan, and then call downstream services to frequently delay data. Use A/B testing to simulate normal interface access to the cart. Start two threads, and perform interface access by 10,000 times.
ab -n 10000 -c 2 http://127.0.0.1:8083/cart
2) Monitoring metrics
Define system steady metrics, and select /cart endpoint in the SkyWalking console. The steady metrics are as follows:
- The average response time (RT) was around 15 ms.
- P99 metric is within 20 ms.
3) Expectation Assumption
- Set the timeout period for calls to avoid client request blocking for a long time.
- Configure the service blow policy/service degradation.
4) Chaos Experiment
The ChaosBlade installation and its simple usage are described in the previous section. Now, a latency fault (30 seconds of delay) is injected into the downstream Dubbo cart service by ChaosBlade. Then, blade create dubbo delay –h
command is executed to check how Dubbo calls the delay:
Dubbo interface to do delay experiments, support provider and consumerUsage:
blade create dubbo delayExamples:
# Invoke com.alibaba.demo.HelloService.hello() service, do delay 3 seconds experiment
blade create dubbo delay --time 3000 --service com.alibaba.demo.HelloService --methodname hello --consumerFlags:
--appname string The consumer or provider application name
--consumer To tag consumer role experiment.
--effect-count string The count of chaos experiment in effect
--effect-percent string The percent of chaos experiment in effect
--group string The service group
-h, --help help for delay
--methodname string The method name
--offset string delay offset for the time
--override only for java now, uninstall java agent
--pid string The process id
--process string Application process name
--provider To tag provider experiment
--service string The service interface
--time string delay time (required)
--timeout string set timeout for experiment in seconds
--version string the service versionGlobal Flags:
-d, --debug Set client to DEBUG mode
--uid string Set Uid for the experiment, adapt to docker
According to the case and parameter explanation, the upstream service client needs to inject a latency fault (30 seconds of latency). With SkyWalking, the Dubbo service information on the procedure can be easily found. Search for the procedure with the endpoint /cart first, and find the Dubbo service, as shown in the following figure:
- Procedure search
- Receive detailed information about the protocol.
Click in to check the detailed span information of the Dubbo service. After obtaining the URL of the Dubbo service, the parameters required to use ChaosBlade to inject the upstream service delay are obtained. Therefore, our final parameter structure is:
- — time 30000: 30s of delay
- — service com.alibabacloud.hipstershop.cartserviceapi.service.CartService: Service
- — methodname viewCart: Service method
- — process frontend: Java process
- — consumer: Currently a Dubbo service client
Issue command and inject fault:
blade create dubbo delay --time 30000 --service com.alibabacloud.hipstershop.cartserviceapi.service.CartService --methodname viewCart --process frontend --consumer
5) Monitoring Metrics
Check the system metrics after fault injection, and check the metrics on SkyWalking:
- The average RT is about 2,000 ms, and the P99 metric is about 2,000 ms.
- An error is reported on /cart interface calling that the “com.alibabacloud.hipstershop.cartserviceapi.service.CartService” service is abnormal.
- A timeout error occurs. The timeout period is 2,000 ms.
Conclusion: The upstream service is configured with the call timeout period, but without service blow policy, which is actually not as expected.
6) Problem Fix
Configure the service blow policy/service degradation.
5. Case 2
1) Scenario
During running, Dubbo service provider fails to access the registry. The fault of network packet loss (100%) is injected in the registry.
2) Monitoring Metrics
Define system steady metrics and select service endpoints in the SkyWalking console. The steady metrics are as follows:
- The “com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart” service is normal.
3) Expectation Assumption
Upstream and Downstream services will not be affected.
4) Chaos Experiment
Inject packet loss fault (100%) in registry port. In this case, Nacos is used as the registry for Dubbo. The default port is 8848, and the network interface card (NIC) is eth0. The command parameters are as follows:
- — interface eth0: NIC
- — percent 100: 100% of packet loss rate
- — local-port: Local port 8848
Issue command and inject fault:
blade create network loss --interface eth0 --percent 100 --local-port 8848
5) Monitoring Metrics
Select the service endpoint in the SkyWalking console after fault injection. Steady metrics are as follows:
- The “com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart” service is normal.
Conclusion: The service is weakly dependent on the registry and the service itself has a local cache, which is in line with the expected assumption.
Assume that the application is now deployed in a Kubernetes cluster. The horizontal scaling capability of the registry can be verified. ChaosBlade also supports Kubernetes cluster scenarios. .
Simple Practice
In the above cases, it has been tested whether the service is configured with timeout and blow policies. The weak dependency of Dubbo on the registry and the local cache for the service are also verified. Can’t wait to experience it in your own system, right? ChaosBlade provides a wide range of experiment scenarios for everyone. It not only supports basic resources and applications, but also is a powerful tool for Cloud Native platforms. ChaosBlade is user-friendly and provides detailed parameters to control the minimum explosion radius of the fault. ChaosBlade will make it very easy for everyone to get started.
It is not enough to only talk on paper. Here an additional simple case is provided for everyone to practice. We often deal with relational databases in application development. When the application traffic increases rapidly, bottlenecks often occur on the database side, resulting in a lot of slow SQLs. When there is no slow SQL alert, it is difficult to find the original SQL for optimization. Therefore, slow SQL alert is very important. To verify whether an application supports this capability, ChaosBlade injects slow SQL fault of MySQL. Then, it executes “blade create mysql delay –h” command to see how MySQL calls the delay:
Mysql delay experimentUsage:
blade create mysql delayExamples:
# Do a delay 2s experiment for mysql client connection port=3306 INSERT statement
blade create mysql delay --time 2000 --sqltype select --port 3306Flags:
--database string The database name which used
--effect-count string The count of chaos experiment in effect
--effect-percent string The percent of chaos experiment in effect
-h, --help help for
--host string The database host
--offset string delay offset for the time
--override only for java now, uninstall java agent
--pid string The process id
--port string The database port which used
--process string Application process name
--sqltype string The sql type, for example, select, update and so on.
--table string The first table name in sql.
--time string delay time (required)
--timeout string set timeout for experiment in secondsGlobal Flags:
-d, --debug Set client to DEBUG mode
--uid string Set Uid for the experiment, adapt to docker
As shown, ChaosBlade provides a complete example and supports parameters with a smaller granularity, such as SQL types and table names. Try to perform a 10s of delay for the “select” operation when port 3306 is connected. Is there an alert in your application when the traffic hits?
blade create mysql delay --time 10000 --sqltype select --port 3306
Command parameter explanation:
- — time 10000: 10s of delay
- — sqltype select: Only select type of SQL statements is supported.
- — port 3306: Only connections to port 3306 are supported.
Summary
This article describes the application of Chaos Engineering in complex distributed architectures. It also introduces chaos experiments with ChaosBlade and SkyWalking to analyze and optimize the system, based on the fault performance. Thus, the stability and high availability of the system is continuously improved. ChaosBlade not only supports basic resources and applications, but also serves as a useful tool on the Cloud Native platform. You are welcomed to use it.
ChaosBlade project is available at this address: https://github.com/chaosblade-io/chaosblade. You are welcome to join us and work together! Visit the following page for the Contribution Guide.