How to Locate Bottlenecks During Performance Tests and Address Occasional Timeouts?
By Qi Zhang (Sichu)
Are you still using the tedious traditional approach to examine each line of code for unearthing errors?
Fret not! As the saying goes, “Arthas keeps you from worrying about troubleshooting.” Let’s talk about Arthas, the magical Java diagnostics tool.
What Is Arthas?
Arthas is an open-source online diagnostics tool that uses command-line interaction mode and supports Web-based diagnostics. It also provides the comprehensive Tab auto-completion feature to further facilitate problem locating and diagnostics. Arthas is a Java diagnostics tool, which is used by 99% of Alibaba developers, has earned 20,000 stars on GitHub since it became open-source more than a year ago. Instead of direct downloading, I recommend developers to use Arthas in Cloud Toolkit, which is an integrated development environment (IDE) plug-in, to implement one-click remote diagnostics.
Thanks to its powerful and comprehensive features, Arthas can do more than you can imagine. Arthas has a quick and relevant resolution to all your worries
- You want a global view for monitoring the running status of your system.
- You are wondering what occupies the CPU whenever the CPU usage rises.
- You want to figure out whether there is any deadlock or blocking in multiple threads that are running simultaneously.
- It takes a long time to run an app and you are not sure which part is time-consuming. You want a solution to monitor it.
- You do not know the specific JAR package from which a class was loaded. You want to figure out why your system throws various class-related exceptions.
- You are not sure why the execution of your modified code failed. You cannot remember whether you have committed the changes. You are not sure if you are using the right branch.
- A problem occurs and you cannot debug online. You are wondering whether you have to add logs to your app and publish it again.
- You want a solution to monitor the real-time running status of your JVM.
For more information about the commands and features of Arthas, see its Official Documentation.
In the following sections, I will list several scenarios that I came across recently.
Scenario 1: Locating the Performance Bottleneck During a Stress Test
Usually, requests to the server are normal. However, during a stress test, when all CPUs on the server approach 100% utilization, the dependent services and databases did not run into a bottleneck. Why?
While running the jstack command, I just saw the stack at a certain point in time. As a result, locating the real problem becomes an issue.
How to view information about the current thread and its stack.
Use thread -n 3 -i 10000 to query the top 3 busiest threads within 10 seconds and print their stacks, which facilitates problem locating. The problem that I finally found was relatively simple: The location information, including class name, method name, and line number, is printed in the log.
Dynamically acquire code information including method name and line number. This is usually done by new Throwable() -> print Throwable stack -> extract the topmost business code in the stack -> split string to acquire information including class, method, and line number.
However, printing stacks may cause significant performance loss.
Scenario 2: Detecting Occasional Timeouts
There was a time when I always came across occasional timeouts. However, logs seemed all right, and so did EagleEye traces. No database operation or HSF call was particularly slow.
The elapsed time of the requests seemed quite normal according to statistics from the monitoring system. I could not find any abnormal RT.
I thought there might be a problem with logs, but there was no evidence to support it.
I ran the ttrace command to monitor the elapsed time of each step. In addition, used conditional expressions to print detailed logs when the elapsed time exceeds a certain time.
Further, I executed the command on a machine and waited. When the timeout reoccurs, the distribution of elapsed time could be captured.
Based on the results from Arthas, I located the problem: log printing. Once I changed synchronous logs to asynchronous logs, the problem was solved.
Scenario 3: Debug? What If It Is Generated with Dynamic Bytecode?
Once I came across a problem, where the output numbers during JSON serialization were not quoted. To resolve the issue, I tried to debug and examine my code in various ways and found that it was a serialization class that was generated by using ASM dynamic bytecode. Therefore, I completely gave up debugging as it wasn’t effective in locating the problem.
Alternatively, it’s easy to locate and troubleshoot the problem by decompiling the classes that are generated with dynamic bytecode by using the jad command of Arthas and other commands such as watch.
The jad command decompiles the source code of a given loaded class.
Also, use the mc(memory compiler) and redefine commands to update the code online. Go on, start exploring!
Do these capabilities make us omnipotent? No. Let’s take a look at the following scenario.
Scenario 4: Some Tricks
In the troubleshooting process, I found that logs were output to the console, which caused significant performance loss. Was there any quick solution to solve the problem without publishing it?
Find the Corresponding Class
sc -d ch.qos.logback.core.ConsoleAppender
Obtain the Class Property Information and Find the Appender List
ognl -c 5f205aa '@org.slf4j.LoggerFactory@getLogger("root").aai.appenderList'
Delete Standard Output Appender
1ognl -c 5f205aa '@org.slf4j.LoggerFactory@getLogger("root").aai.appenderList.remove(0)'
A Magical Tool: Flame Graphs
While troubleshooting performance problems, use flame graphs, a magical tool that clearly display the statistics of elapsed time associated with each method within a period of time.
Get Started with Arthas
Approach 1: Use Cloud Toolkit for One-click Remote Diagnostics with Arthas
Released by Alibaba Cloud, Cloud Toolkit is a free local IDE plug-in that helps developers to efficiently develop, test, diagnose, and deploy applications. Use the plug-in to deploy local applications with one click on any server or even to the cloud, such as Elastic Compute Service (ECS), Enterprise Distributed Application Service (EDAS), Alibaba Cloud Container Service for Kubernetes (ACK), Alibaba Cloud Container Registry (ACR), and Mini Program Cloud. Also, use its built-in tools including Arthas Diagnostics, Dubbo, Terminal, File Upload, Function Compute, and MySQL Executor. In addition to IntelliJ IDEA, the mainstream version, other versions such as Eclipse, Pycharm, and Maven are also available.
We recommend using the IDEA plug-in to download Cloud Toolkit and use Arthas.
Approach 2: Directly Download Arthas
Download Arthas by using the URL: https://github.com/alibaba/arthas