ChaosBlade x SkyWalking: High Availability Microservices Practices

By Ye Fei (GitHub Account: @tiny-x), an open-source community enthusiast and ChaosBlade committer, participating in promoting the ecosystem construction of chaos engineering in ChaosBlade.


Chaos Engineering (CE) was developed specifically to overcome these challenges. Through injecting faults into the system in a controllable range or environment, developers can observe the system behavior and locate the system defects. By doing so, the ability and confidence to deal with chaos in a distributed system due to unexpected conditions are developed. Thus, the stability and high availability of the system are continuously improved.

The implementation process of Chaos Engineering is to formulate a chaos experiment plan, define steady metrics, make assumptions of fault-tolerant behavior in the system, execute chaos experiment, check steady metrics in system, and so on. For this reason, the entire chaos experiment process requires reliable, easy-to-use, and scenario-rich chaos experiment tools to inject faults. Complete distributed procedure tracking and system monitoring tools are also required. Tracking and monitoring tools aim to trigger the emergency response and early warning scheme, then to quickly locate the fault, and observe various system data metrics during the whole process.

In this article, we will introduce the chaos experiment tool ChaosBlade and the distributed system monitoring tool SkyWalking. We will also discuss about the high-availability microservices practices of ChaosBlade and SkyWalking through a microservices case.

Tool Introduction


ChaosBlade is easy to use and supports a wide range of experiment scenarios, including the following:

  • Basic resources: Experimental scenarios such as CPU, memory, network, disk, and process.
  • Java applications: Databases, caches, messages, JVMs, and microservices. Any methods can be specified to be injected with fault.
  • C++ application: Scenarios like injecting delay, variables, and tampered returned values to specified methods or rows of code.
  • Docker container: Experimental scenarios such as disabling of containers, or CPU, memory, network, disk, and process in containers.
  • Cloud Native platform: Experimental scenarios on Kubernetes such as the CPU, memory, network, disk, and process. Pod network and Pod disabling. The experiment scenario of container as shown above.

ChaosBlade encapsulates scenarios into individual projects according to the domains. This not only realizes the scenario standardization in the domain, but also facilitates horizontal and vertical scaling of scenarios. By following the chaos experiment model, chaosblade CLI is called in a unified way.


The core features are as follows:

  • Analysis of services, service instances, and endpoint metrics
  • Root cause analysis
  • Service topology analysis
  • Analysis of services, service instances, and endpoint dependencies
  • Slow service and endpoint detection
  • Performance optimization
  • Distributed tracing and context propagation
  • Detection of database access metrics and slow database access statements (including SQL statements)
  • Alerts

Tool Installation and Usage

ChaosBlade Installation

## Download
## Decompress
tar -zxf chaosblade-0.9.0-linux-amd64.tar.gz
## Set environment variables
export PATH=$PATH:chaosblade-0.9.0/
## Test
blade –h

ChaosBlade Usage

1) How to Use the Blade

The blade -h command can be executed to check which commands are supported:

An easy to use and powerful chaos engineering experiment toolkitUsage:
blade [command]
Available Commands:
create Create a chaos engineering experiment
destroy Destroy a chaos experiment

2) Create Experiment Scenarios

For example, to create a full-load CPU scenario, the blade create cpu fullload –h command can be executed to check specific scenario parameters and select relevant parameters to run the command:

Create chaos engineering experiments with CPU loadUsage:
blade create cpu fullload
fullload, fl, load
Examples:# Create a CPU full load experiment
blade create cpu load
#Specifies two random kernel's full load
blade create cpu load --cpu-percent 60 --cpu-count 2
--blade-release string Blade release package,use this flag when the channel is ssh
--channel string Select the channel for execution, and you can now select SSH
--climb-time string durations(s) to climb
--cpu-count string Cpu count
--cpu-list string CPUs in which to allow burning (0-3 or 1,3)
--cpu-percent string percent of burn CPU (0-100)

3) Resume the Experiment

ChaosBlade supports three methods to resume an experiment as shown in the followings:

  • After an experiment is successfully created, ChaosBlade returns a UID. The blade destroy uid command can be executed to resume the experiment.
  • Execute the blade destroy target action (such as “blade destroy cpu fullload”) if no corresponding UID is available.
  • Add the “ — timeout 10” parameter when creating an experiment. The experiment automatically resumes after being executed for ten seconds. Besides, the parameter can act as an expression, such as “ — timeout 30m” for 3 minutes.

SkyWalking Installation and Usage

After the tool is deployed, it’s time to introduce the building of a highly available microservices system by taking cases as references. Through fault injection and system behavior observation, ChaosBlade and SkyWalking can locate problems and discover system defects to build a highly available microservices system.

Case on Application Fault Tolerance

Case Environment

Application Topology

Chaos Experiment Steps

  • Define system steady metrics
  • Make assumptions about system fault tolerance behavior
  • Run chaos experiment
  • Check steady metrics
  • Record and resume chaos experiment
  • Fix the problems
  • Automated Continuous verification

Next, a chaos experiment will be performed using ChaosBlade according to the steps.

Case 1

Create a chaos experiment plan, and then call downstream services to frequently delay data. Use A/B testing to simulate normal interface access to the cart. Start two threads, and perform interface access by 10,000 times.

ab -n 10000 -c 2

2) Monitoring metrics

Define system steady metrics, and select /cart endpoint in the SkyWalking console. The steady metrics are as follows:

  • The average response time (RT) was around 15 ms.
  • P99 metric is within 20 ms.

3) Expectation Assumption

  • Set the timeout period for calls to avoid client request blocking for a long time.
  • Configure the service blow policy/service degradation.

4) Chaos Experiment

The ChaosBlade installation and its simple usage are described in the previous section. Now, a latency fault (30 seconds of delay) is injected into the downstream Dubbo cart service by ChaosBlade. Then, blade create dubbo delay –h command is executed to check how Dubbo calls the delay:

Dubbo interface to do delay experiments, support provider and consumerUsage:
blade create dubbo delay
# Invoke service, do delay 3 seconds experiment
blade create dubbo delay --time 3000 --service --methodname hello --consumer
--appname string The consumer or provider application name
--consumer To tag consumer role experiment.
--effect-count string The count of chaos experiment in effect
--effect-percent string The percent of chaos experiment in effect
--group string The service group
-h, --help help for delay
--methodname string The method name
--offset string delay offset for the time
--override only for java now, uninstall java agent
--pid string The process id
--process string Application process name
--provider To tag provider experiment
--service string The service interface
--time string delay time (required)
--timeout string set timeout for experiment in seconds
--version string the service version
Global Flags:
-d, --debug Set client to DEBUG mode
--uid string Set Uid for the experiment, adapt to docker

According to the case and parameter explanation, the upstream service client needs to inject a latency fault (30 seconds of latency). With SkyWalking, the Dubbo service information on the procedure can be easily found. Search for the procedure with the endpoint /cart first, and find the Dubbo service, as shown in the following figure:

  • Procedure search
  • Receive detailed information about the protocol.

Click in to check the detailed span information of the Dubbo service. After obtaining the URL of the Dubbo service, the parameters required to use ChaosBlade to inject the upstream service delay are obtained. Therefore, our final parameter structure is:

  • — time 30000: 30s of delay
  • — service com.alibabacloud.hipstershop.cartserviceapi.service.CartService: Service
  • — methodname viewCart: Service method
  • — process frontend: Java process
  • — consumer: Currently a Dubbo service client

Issue command and inject fault:

blade create dubbo delay --time 30000 --service com.alibabacloud.hipstershop.cartserviceapi.service.CartService --methodname viewCart --process frontend --consumer

5) Monitoring Metrics

  • The average RT is about 2,000 ms, and the P99 metric is about 2,000 ms.
  • An error is reported on /cart interface calling that the “com.alibabacloud.hipstershop.cartserviceapi.service.CartService” service is abnormal.
  • A timeout error occurs. The timeout period is 2,000 ms.

Conclusion: The upstream service is configured with the call timeout period, but without service blow policy, which is actually not as expected.

6) Problem Fix

Configure the service blow policy/service degradation.

5. Case 2

During running, Dubbo service provider fails to access the registry. The fault of network packet loss (100%) is injected in the registry.

2) Monitoring Metrics

Define system steady metrics and select service endpoints in the SkyWalking console. The steady metrics are as follows:

  • The “com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart” service is normal.

3) Expectation Assumption

Upstream and Downstream services will not be affected.

4) Chaos Experiment

Inject packet loss fault (100%) in registry port. In this case, Nacos is used as the registry for Dubbo. The default port is 8848, and the network interface card (NIC) is eth0. The command parameters are as follows:

  • — interface eth0: NIC
  • — percent 100: 100% of packet loss rate
  • — local-port: Local port 8848

Issue command and inject fault:

blade create network loss --interface eth0 --percent 100 --local-port 8848

5) Monitoring Metrics

Select the service endpoint in the SkyWalking console after fault injection. Steady metrics are as follows:

  • The “com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart” service is normal.

Conclusion: The service is weakly dependent on the registry and the service itself has a local cache, which is in line with the expected assumption.

Assume that the application is now deployed in a Kubernetes cluster. The horizontal scaling capability of the registry can be verified. ChaosBlade also supports Kubernetes cluster scenarios. .

Simple Practice

It is not enough to only talk on paper. Here an additional simple case is provided for everyone to practice. We often deal with relational databases in application development. When the application traffic increases rapidly, bottlenecks often occur on the database side, resulting in a lot of slow SQLs. When there is no slow SQL alert, it is difficult to find the original SQL for optimization. Therefore, slow SQL alert is very important. To verify whether an application supports this capability, ChaosBlade injects slow SQL fault of MySQL. Then, it executes “blade create mysql delay –h” command to see how MySQL calls the delay:

Mysql delay experimentUsage:
blade create mysql delay
# Do a delay 2s experiment for mysql client connection port=3306 INSERT statement
blade create mysql delay --time 2000 --sqltype select --port 3306
--database string The database name which used
--effect-count string The count of chaos experiment in effect
--effect-percent string The percent of chaos experiment in effect
-h, --help help for
--host string The database host
--offset string delay offset for the time
--override only for java now, uninstall java agent
--pid string The process id
--port string The database port which used
--process string Application process name
--sqltype string The sql type, for example, select, update and so on.
--table string The first table name in sql.
--time string delay time (required)
--timeout string set timeout for experiment in seconds
Global Flags:
-d, --debug Set client to DEBUG mode
--uid string Set Uid for the experiment, adapt to docker

As shown, ChaosBlade provides a complete example and supports parameters with a smaller granularity, such as SQL types and table names. Try to perform a 10s of delay for the “select” operation when port 3306 is connected. Is there an alert in your application when the traffic hits?

blade create mysql delay --time 10000 --sqltype select --port 3306

Command parameter explanation:

  • — time 10000: 10s of delay
  • — sqltype select: Only select type of SQL statements is supported.
  • — port 3306: Only connections to port 3306 are supported.


ChaosBlade project is available at this address: You are welcome to join us and work together! Visit the following page for the Contribution Guide.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.