ChaosBlade — An Open-Source Chaos Engineering Tool by Alibaba

The best way to reduce faults is to increase the occurrence frequency of problems. Within a controllable range or environment, we can constantly improve the fault tolerance capability and resilience of a system by repeatedly reproducing faults.

How many steps do you need to implement a highly effective chaos engineering experiment? The answer: just two steps.

Step 1: Log on to ChaosBlade.

Step 2: Download the release version to build a tool for fault drills.

Service Stability through High Availability

For example, Alibaba Cloud Performance Testing Service (PTS) enables you to build an end-to-end stress testing system. The open source component Sentinel allows you to implement the throttling and downgrade feature. After six years of improvement and practicing, including tens of thousands of online drills, Alibaba condenses its ideas and practices in the fault drill field into a chaos engineering tool. This tool, ChaosBlade, was released as an open source project.

To access the project and experience the demo, click here.

Introduction to ChaosBlade

ChaosBlade is developed based on the Apache License 2.0 protocol and it currently has two repositories: chaosblade and chaosblade-exe-jvm.

The chaosblade repository contains command line interfaces (CLIs), basic resources implemented by using Golang, and container-related chaos experiment executors. The chaosblade-exe-jvm repository is a ChaosBlade executor for chaos experiments on applications that are running on Java virtual machines (JVMs).

Later on, the ChaosBlade community will add chaos experiment executors for other languages such as C++ and Node.js.

Reasons for Making ChaosBlade Open Source

Before releasing ChaosBlade, we can already find many outstanding open source chaos engineering tools. Each of these tools may effectively address problems of a particular field. However, many of them are unfriendly to beginners, and some support very limited scenarios. These are the reasons why people find chaos engineering difficult to implement.

With years of experience in practicing chaos engineering, Alibaba Group releases the chaos engineering experiment tool ChaosBlade as an open source project for the following purposes:

  • To help more people understand and work on chaos engineering.
  • To streamline the process of chaos engineering.
  • To discover and improve more and more chaos engineering experiment scenarios with community contributions, and to jointly promote the development of the chaos engineering industry.

Problems Solved by ChaosBlade

Measuring Fault Tolerance Capability of Microservices

Verifying the Reasonableness of Container Orchestration Configuration

Verifying the Robustness of the PaaS layer

Verifying the Timeliness of Monitoring Alarms

Verifying the Ability to Locate and Solve Problems upon Emergencies

Functions and Features of ChaosBlade

Extensive Scenarios

Simple to Use and Easy to Understand

Convenient Scenario Scaling

Evolution History of ChaosBlade

EOS is an earlier version of the fault drill platform. Faults are injected into the system through bytecode enhancement. EOS supports simulating common remote procedure call (RPC) faults, and managing strong and weak dependencies of microservices.

MonkeyKing (2016–2018):

MonkeyKing is an upgraded version of the fault drill platform, with more extensive fault scenarios, such as resources and container-layer scenarios. MonkeyKing supports performing large scale drills in the production environment.

AHAS (2018.9-Present):

Integrated with all functions of the fault drill platform, Alibaba Cloud Application High Availability Service (AHAS) supports orchestrated drills and drill plug-ins. AHAS also integrates the architecture awareness and throttling downgrade features.

ChaosBlade (2019.3):

ChaosBlade implements underlying layer fault injection for the MonkeyKing platform. ChaosBlade defines a set of fault models by aggregating the underlying fault injection capability of the fault drill platform. ChaosBlade is released as an open source project in combination with the user-friendly CLI tool to help cloud-native users perform chaos engineering tests.

Short-Term Planning

Function Iteration

  • Enhance the Kubernetes drill scenarios
  • Add support for other applications such as C++ and Node.js

Community Construction

  • Architecture design
  • Module design
  • Code implementation
  • Bug fix
  • Demo
  • Document and website translation

About the Authors

Alibaba Cloud AHAS team senior technical expert, with years of experience in research and development of stability products, architectural evolution, and support for normal operation and large-scale promotion campaigns. Zhou Yang is also the founder of the fault drill platform MonkeyKing, technical owner of the AHAS product, and proponent of chaos engineering.

Xiao Changjun (Qionggu)

Alibaba Cloud senior development engineer, with years of experience in application performance monitoring and chaos engineering. Xiao Changjun is a backbone developer of Alibaba Cloud AHAS, and the owner of the ChaosBlade open source project.

Reference

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.