By Bruce Wu
Performance diagnosis is a problem that software engineers need to frequently face and solve in their daily work. Today, user experience matters the most, and improving application performance can provide significant advantages. Java is one of the most popular programming languages. Its performance diagnosis has been attracting a lot of attention across the industry for a long time. Many factors may cause performance problems in Java applications. Such factors include thread control, disk I/O, database access, network I/O, and garbage collection (GC). You need an excellent performance diagnostic tool to find these problems. This article describes some common Java performance diagnostic tools, and highlights the basic principles and best practices of JProfiler, an excellent representative of these tools. The JProfiler version discussed in this article is
Brief Introduction to Java Performance Diagnostic Tools
Various performance diagnostic tools are available in the Java world, including simple command line tools, such as jmap and jstat, and comprehensive graphical diagnostic tools, such as JVisualvm and JProfiler. The following section briefly describes each of these tools.
Simple Command Line Tools
The Java Development Kit (JDK) offers many built-in command line tools. They can help you obtain information of the target Java virtual machine (JVM) from different aspects and different layers.
- jinfo — allows you to view and tune various parameters of the target JVM in real-time.
- jstack — allows you to obtain thread stack information of the target Java process, detect deadlocks, and to locate infinite loops.
- jmap — allows you to obtain memory-related information of the target Java process, including the usage of different Java heaps, statistical information of objects in Java heaps, and loaded classes.
- jstat — is a light-weight versatile monitoring tool. It allows you to obtain various information about loaded classes, Just-In-Time (JIT) compilation, garbage collection, and memory usage of the target Java process.
- jcmd — is more comprehensive than jstat. It allows you to obtain various information about the performance statistics, Java Flight Recorder (JFR), memory usage, garbage collection, thread stacking, and JVM runtime of the target Java process.
Comprehensive Graphical Diagnostic Tools
You can use any or any combinations of the above command line tools to help you obtain the basic performance information about the target Java application. However, they have the following drawbacks:
- They cannot obtain the method-level analytical data, such as the call relationship between different methods, and the frequency and duration that a method is called. They are very important for identifying the performance bottlenecks of your application.
- You must log on to the host machine of the target Java application to use them, which is not very convenient.
- The analytical data is generated at a terminal, and the results are not intuitively displayed.
The following are several comprehensive graphical performance diagnostic tools:
JVisualvm is a built-in visual performance diagnostic tool provided by the JDK. It obtains analytical data of the target JVM, including CPU usage, memory usage, threads, heaps, and stacks, by using various methods such as JMX, jstatd, and Attach API. In addition, it can intuitively display the quantity and size of each object in Java heaps, the number of times that a Java method is called, and the duration that a Java method is executed.
JProfiler is a Java application performance diagnostic tool developed by ej-technologies. It focuses on addressing four important topics:
- Method calls — The analysis of method calls helps you to understand what your application is doing and find ways to improve its performance.
- Allocations — Through analyzing objects on the heap, reference chains, and garbage collection, this functionality enables you to fix memory leaks and optimize memory usage.
- Thread and lock — JProfiler provides multiple analysis views on threads and locks to help you discover multithreading problems.
- High-level subsystems — Many performance problems occur on a higher semantic level. For example, with Java Database Connectivity (JDBC) calls, you probably want to find out which SQL statement is the slowest. JProfiler supports integrated analysis of these subsystems.
Distributed Application Performance Diagnosis
If you only need to diagnose performance bottlenecks of standalone Java applications, the diagnostic tools described above are enough to meet your needs. However, when standalone modern systems gradually evolve into distributed systems and microservices, the tools above cannot meet the requirements. We need to use the end-to-end tracing function of distributed tracing systems such as Jaeger, ARMS, and SkyWalking. A variety of distributed tracing systems are available in the market, but their implementation mechanisms are similar. They record the tracing information through code tracking, transmit the recorded data to central processing systems through SDKs or agents, and provide query interfaces to display and analyze the results. For more information about the principles of distributed tracing systems, see the article titled OpenTracing Implementation of Jaeger.
Introduction to JProfiler
JProfiler consists of a JProfiler agent to collect analytical data from the target JVM, a JProfiler UI to visually analyze data, and command line utilities to provide various functions. The big picture of all important interactions between them is given as follows.
The JProfiler agent is implemented as a native library. You can load it at the startup of a JVM by using the parameter
-agentpath:<path to native library>, or when the application is running by using the JVM Attach mechanism. After the JProfiler agent is loaded, it sets up the JVM tool interface (JVMTI) environment, to monitor all kinds of events generated by the JVM, such as thread creation and class loading. For example, when it detects a class loading event, the JProfiler agent inserts its own bytecode to these classes to perform its measurements.
The JProfiler UI is started separately and connects to the profiling agent through a socket. This means that it is actually irrelevant if the profiled JVM is running on the local machine or on a remote machine — the communication mechanism between the profiling agent and the JProfiler UI is always the same.
From the JProfiler UI, you can instruct the agent to record data, display the profiling data in the UI, and save snapshots to disk.
Command Line Tools
JProfiler provides a series of command line tools to implement different functions.
- jpcontroller — is used to control how the agent collects data. It sends instructions to the agent through the JProfiler MBean registered by the agent.
- jpenable — is used to load the agent to a running JVM.
- jpdump — is used to capture the heap snapshots of a running JVM.
- jpexport & jpcompare — is used to extract data and create HTML reports from previously saved snapshots.
JProfiler supports diagnosing the performance of both local and remote Java applications. If you need to collect and display analytical data of a remote JVM in real time, perform the following steps:
- Install JProfiler UI locally.
- Install JProfiler agent on the remote host machine, and load it to the target JVM.
- Connect JProfiler UI to the agent.
This section shows you how to use JProfiler to diagnose the performance of Alibaba Cloud LOG Java Producer(the “Producer”), a LogHub class library. If you encounter any performance problems with your application or when you use Producer, you can take similar measures to find out the root cause. If you do not know Producer, we recommend that you read this article first: Alibaba Cloud LOG Java Producer — A powerful tool to migrate logs to the cloud.
For sample code used in this section, see SamplePerformance.java.
Data Collection Mode
JProfiler provides two data collection methods: sampling and instrumentation.
- Sampling — suitable for scenarios that do not require high data collection accuracy. The advantage of this method that it has low impact on system performance. The drawback is that it does not support some features such as method-level statistics.
- Instrumentation — the complete data collection mode that supports high accuracy. The drawback is that it must analyze many classes and it has a relatively heavy impact on the application performance. To reduce the impact, you may want to use it with a filter.
In this example, we need to obtain the method-level statistical information, so we choose the instrumentation method. A filter is configured to ensure that the agent records the CPU data of only two classes,
com.aliyun.openservices.log.Client, under the java package.
Startup Modes of the Application
You can specify different parameters for the JProfiler agent to control the startup mode of the application.
- Wait for connection from the JProfiler GUI — The application is started only when JProfiler GUI establishes connection with the profiling agent, and the profiling settings are completed. With this option you can profile the startup phase of your application. The command that you can use to enable this option:
-agentpath:<path to native library>=port=8849.
- Start immediately, connect later with the JProfiler GUI — JProfiler GUI establishes connection with the profiling agent when it is necessary, and transmits the profiling settings. This option is flexible, but it cannot profile the startup phase of your application. The command that you can use to enable this option:
-agentpath:<path to native library>=port=8849,nowait.
- Profile offline, JProfiler cannot connect — You have to configure triggers that record data and save snapshots that can be opened with the JProfiler GUI later on. The command that you can use to enable this option:
-agentpath:<path to native library>=offline,id=xxx,config=/config.xml.
In the testing environment, we need to determine the performance of the application during the startup phase. So we use the default WAIT option here.
Use JProfiler to Diagnose the Application Performance
After completing the profiling settings, you can proceed with the performance diagnosis for Producer.
On the overview page, we can clearly view graphs (telemetries) about various metrics, such as memory, GC activities, classes, threads, and the CPU load.
We can make the following assumptions based on this telemetry:
- A large number of objects are generated during the application running process. These objects have a very short lifecycle, and most of them can be promptly recycled by the garbage collector. They won’t cause memory usage to continually grow.
- As expected, the number of loaded classes increases rapidly during the startup period, and stabilizes afterward.
- When the application is running, many threads are blocked. Extra attention must be paid to this issue.
- At the startup phase of the application, the CPU usage is high. We need to find out why.
The number of executions, execution time, and call relationships of each method in the application are displayed by CPU views. They are helpful in locating the method that has the most significant impact on the performance of your application.
Call tree clearly shows the hierarchical call relationships between different methods by using tree graphs. In addition, JProfiler sorts the sub-methods by their total execution time, which allows you to quickly locate key methods.
For Producer, the method
SendProducerBatchTask.run() takes most of the time. If you continue to look down, you will find that most of the time is taken by executing the method
If you have many application methods, and many of your sub-methods are executed in a short interval, the Hot Spots view can help you quickly locate performance problems. This view can sort the methods based on various factors, such as their individual execution time, total execution time, average execution time, and number of calls. The individual execution time is equal to the total execution time of the method minus the total execution time of all sub-methods.
In this view, you can see that the following three methods:
SamplePerformance$1.run() take the most time to be executed individually.
After you find the key methods, the Call Graph view can present you all methods that directly associate with these key methods. This is helpful in finding the solution to the problem and develop the best performance optimization policy.
Here, we can see that most of the execution time of the method
Client.PutLogs() is spent on object serialization. Therefore, the key to performance optimization is to provide a more efficient serialization method.
Live Memory views allow you to know the detailed memory allocation and usage to help you determine if there is a memory leak.
The All Objects view shows the number and total size of various objects in the current heap. As you can see from the following figure, a large number of LogContent objects are created during the running process of the application.
Allocation Call Tree
The Allocation Call Tree view shows the amount of memory that has been allocated to each method in the form of a tree diagram. As you can see,
SendProducerBatchTask.run() are large memory consumers.
Allocation Hot Spots
If you have many methods, you can quickly find out which method has been assigned the most objects in the Allocation Hot Spots view.
The Thread History view intuitively shows the status of each thread at different time points.
Tasks run by different threads have different state features.
- The thread
Pool-1-thread-<M>periodically calls the
Producer.send ()method to asynchronously send data. They continued to run while the application was starting, but was mostly blocked afterwards. The cause of this phenomenon is that Producer sends data slower than the data generation speed, and the cache size of each Producer instance is limited. After the startup of the application, Producer had enough memory to cache the data that is waiting to be sent, so
pool-1-thread-<M>remained running for a while. This explains why the application had high CPU usage at startup. As time passed by, the cache memory is fully occupied.
pool-1-thread-<M>must wait until Producer releases sufficient space. That is why a large number of threads are blocked.
aliyun-log-producer-0-moverdetects and sends expired batches to IOThreadPool. The data accumulation speed is fast, a producer batch is sent to IOThreadPool by
pool-1-thread-<M>immediately after the cached data size reached the upper limit. Therefore, the mover thread remained idle most of the time.
aliyun-log-producer-0-io-thread-<N>sends data from IOThreadPool to your designated logstore, and it takes most of the time on the network I/O status.
aliyun-log-producer-0-success-batch-handlerhandlesbatches that were successfully sent to the logstore. The callback is simple and it takes a very short time to execute. Therefore, the SuccessBatchHandler remained idle most of the time.
aliyun-log-producer-0-failure-batch-handlerhandles batches that failed to be sent to the logstore. In our case, no data has failed to be sent. It remained idle all the time.
According to our analysis, the statuses of these threads are within our expectation.
Overhead Hot Spots Detected
After the application finishes running, JProfiler displays a dialog box to show frequently called methods with very short running time. Next time, you can set JProfiler agent to ignore these methods to reduce the impact caused by JProfiler to the performance of the application.
Based on the JProfiler diagnosis, the application does not have any significant performance problems or memory leak. The next step of optimization is to improve the serialization efficiency of objects.