Troubleshooting Common Java Performance Problems

8 min readSep 2, 2020

By Chang Haiyun (Yida)

This article describes how to troubleshoot and fix common performance issues and faults that occur when using Java. It also gives several helpful practical methods.

Problem 1: High CPU Utilization

CPU utilization is an important metric for measuring how busy the system is. Generally, high CPU utilization is not a problem because it indicates that the system is continuously processing tasks. However, if the CPU utilization is so high that tasks are piling up and causing a high system load, it becomes dangerous to the system and requires troubleshooting. There is no standard metric value for safe CPU utilization because CPU utilization varies depending on whether your system is compute-intensive or I/O-intensive. Generally, a compute-intensive system has a higher CPU utilization and lower load. This is the opposite for an I/O-intensive system.

Causes and Troubleshooting

1) Frequent Full Garbage Collection (GC) or Young GC

Check the GC log.
Run jstat -gcutil pid to view memory usage and GC condition.

2) Abnormal code-related CPU consumption, such as the consumption caused by endless loops, MD5 operations, and other memory operations

3) Troubleshoot with Arthas

Run thread -n 5 to view the top 5 threads with the highest CPU utilization, including the stack. See the following section for details.

4) Troubleshoot with the jstack command

Run ps –ef | grep java to retrieve the Java process ID.
Run top -Hp pid to identify the thread with the highest CPU utilization.
Run printf '0x%x' tid to convert the thread ID to the hexadecimal format.
Run jstack pid | grep tid to identify the thread stack.

Note: You can enter “1” to view the status of each CPU when running this command. We have seen a case in which a CPU was bound to middleware, causing a spike in CPU utilization.

Problem 2: High CPU load

The CPU load refers to the number of active processes per unit time, including processes in running states (runnable and running) and uninterruptible states (I/O lock and kernel lock.) As you can see, the keywords in this case are “running states” and “uninterruptible states.” The running states can be referred to as the six states of a Java thread, as shown in the following figure. The thread is in a new state after being initialized, then enters the runnable state, and waits for CPU scheduling after being started. In this case, busy CPUs will produce an increasing number of processes in the runnable state. The uninterruptible states include network I/O lock, disk I/O lock, and kernel lock when the thread is in the synchronized state.

Causes and Troubleshooting

1) High CPU utilization with a large number of processes in the runnable state

To troubleshoot, see Problem 1.

2) High iowait value for pending I/O

Run vmstat to check for blocked processes.
Run jstack -l pid | grep BLOCKED to check the blocked thread stack.

3) To troubleshoot, wait for the unlock of kernel lock, for example, when the thread is in the synchronized state.

Run jstack -l pid | grep BLOCKED to check the blocked thread stack.
Use profiler to open the dump file of the thread stack and analyze the locking condition of threads.

Problem 3: Constant Full GC

Before we learn about the causes for a Full GC, let’s review Java Virtual Machine (JVM) memory.

Memory Model

New objects are placed in the Eden space. When the Eden space becomes full, it triggers a Minor GC and moves living objects to S0.

Later, when the Eden space becomes full again, it triggers another Minor GC and moves both living objects and the objects in S0 to S1. In this case, S0 or S1 must be empty.

This cycle repeats until S0 or S1 is about to be full. Objects inside the full space will be moved to the old generation. When the old generation also becomes full, a Full GC is triggered.

For versions earlier than JDK 1.7, Java class information, the constant pool, and static variables are stored in the permeant generation, and the metadata and static variables of a class are imported into the permeant generation when the class is loaded, and are cleared when the class is uninstalled. In JDK 1.8, the metaspace replaces the permeant generation and native memory is used. In addition, the constant pool and static variables are moved to the heap space, which to some extent solves the Full GC problem that occurs when a large number of classes are generated or loaded during runtime, for example, during reflection, proxy, and groovy operations.

Garbage Collector

The young generation often uses ParNew, replication algorithms, and multi-thread parallelism.

The old generation often uses the Concurrent Mark Sweep (CMS) algorithm (which incurs memory fragmentation) and concurrent collection (which involves objects generated by user threads.)

Key Common Parameters

CMSInitiatingOccupancyFraction indicates the old generation occupancy at which a Full GC is triggered.
UseCMSCompactAtFullCollection indicates that the old generation memory is defragmented after a Full GC to avoid memory fragmentation.

Causes and Troubleshooting

1) Promotion Failed

Objects promoted from the S space are too big for the old generation, triggering a Full GC. If the Full GC fails, an out-of-memory (OOM) error is thrown.

The survivor space is too small and the objects enter the old generation too early.

Run jstat -gcutil pid 1000 to check the running condition of memory.
Run jinfo pid to check the SurvivorRatio parameter.

The capacity of memory is insufficient for allocating large objects.

Search for the “allocating large” keywords in the log.
Use profiler to view the memory status and the distribution of large objects.

The old generation contains a large number of objects.

Run jmap -histo pid | sort -n -r -k 2 | head -10 to retrieve the top 10 classes with the greatest number of instances.
Run jmap -histo pid | sort -n -r -k 3 | head -10 to retrieve the top 10 classes with the largest instance capacity.
In the heap dump file, use profiler to analyze the memory usage of different objects.

2) Concurrent Mode Failed

During the CMS GC process, the business thread runs out of memory when moving objects into the old generation, which is common to the concurrent collection.

Potential Causes

1) The extent of triggering a Full GC is too large, causing a high occupancy in the old generation. Meanwhile, user threads keep generating objects during concurrent collection, reaching the threshold of triggering a Full GC.

Run the jinfo command to check that the value of the CMSInitiatingOccupancyFraction parameter ranges from 70 to 80.

2) Memory fragmentation occurs in the old generation.

Run the jinfo command to check the UseCMSCompactAtFullCollection parameter and sort out the memory after a Full GC.

Problem 4: Full Thread Pool

Use a Java thread pool that uses a bounded queue as an example. When a new task is submitted, if the number of running threads is less than corePoolSize, another thread is created to process the request. If the number of running threads is equal to corePoolSize, new tasks are queued until the queue becomes full. When the queue is full, new threads are created to process existing tasks, but the number of the threads does not exceed maximumPoolSize. When the task queue is full and the maximum number of threads is reached, ThreadPoolExecutor denies service for upcoming tasks.

Causes and Troubleshooting

1) The downstream response time (RT) is high and the timeout period is inappropriate.

Business Monitoring
Sunfire
EagleEye

2) Slow SQL queries or database deadlock occurs.

Search the keywords “Deadlock found when trying to get lock” in the log.
Use the jstack or zprofiler command to identify blocked threads.

3) Java code deadlock occurs.

Run jstack –l pid | grep -i –E 'BLOCKED | deadlock to check for deadlock.
In the thread dump file, use zprofiler to analyze blocked threads and locks.

Problem 5: NoSuchMethodException

Causes and Troubleshooting

1) JAR Package Conflict

When Java loads all JAR packages under the same directory, the loading order fully depends on the operating system.

Run mvn dependency:tree and analyze the version of the JAR package with the error. If conflicted JAR package versions are found, always leave the one with the later version while removing the other.
Similarly, run arthas:sc -d ClassName and XX:+TraceClassLoading to check for class conflict.

2) Same Classes

ClassNotFoundException
NoClassDefFoundError
ClassCastException

Frequently Used Tools

Common Commands

1) tail

-f: traces the file.

2) grep

-i: ignores case.
-v: performs reverse lookup.
-E: extends the regular expression, for example, grep -E ‘pattern1|pattern2’ filename.

3) pgm

-b: enables parallel processing.
-p: specifies the parallelism.
-A: enables askpass.

4) awk

-F: specifies the delimiter, for example, awk -F “|” ‘{print $1}’ | sort -r | uniq -c.

5) sed

Time period matching: sed ‘/2020–03–02 10:00:00/,/2020–03–02 11:00:00/p’ filename

Arthas

Alibaba open-source Java diagnostics tool, Arthas, uses the instrumentation method based on JavaAgent to modify bytecode for Java application diagnosis.

Basic Functions

Dashboard: indicates the real-time dashboard that allows you to view information, such as threads, memory usage, and GC condition.
Thread: indicates JVM thread stack information, such as the top n threads that are the busiest.
Getstatic: retrieves static attribute values, for example, getstatic className attrName. It can be used to view the actual values of online toggles.
sc: retrieves the classes loaded to JVM. It can be used to check JAR package conflicts.
sm: retrieves method information of classes loaded in JVM.
jad: decompiles JVM class loading information so you can troubleshoot the code logic execution failure.
watch: collects the execution data of a method, including input parameters, output parameters, and exceptions.
watch xxxClass xxxMethod " {params, throwExp} " -e -x 2
watch xxxClass xxxMethod "{params,returnObj}" "params[0].sellerId.equals('189')" -x 2
watch xxxClass xxxMethod sendMsg '@com.taobao.eagleeye.EagleEye@getTraceId()'
trace: retrieves the internal call duration of a method and outputs the time consumed by each node. It can be used for performance analysis.
tt: records a method and plays it back.

Correcting Common Problems

1) The thread pool is full.

If the thread pool of the Remote Procedure Call (RPC) framework is full, limit the number of threads for interfaces with a high RT.
If the thread pool of an application is full, restart the application to temporarily alleviate the problem, but further correction action depends on the actual cause of the problem.

2) The CPU utilization and load are high.

Replace or restart the server to temporarily alleviate the problem, but further correction action depends on the actual cause of the problem.
Scale out if the load of the cluster is high and the traffic increases significantly, but further correction action depends on the actual cause of the problem

3) The downstream RT is high.

Throttling
Degradation

4) Database Issues

For database deadlock, kill the problem process.
For slow SQL queries, perform SQL throttling.

The troubleshooting of online problems requires accumulated experience. To find the cause and eliminate the problem, you must understand the principles behind the problems. In addition, useful tools can help lower the threshold for troubleshooting and quick recovery.

The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Troubleshooting Common Java Performance Problems

Problem 1: High CPU Utilization

Causes and Troubleshooting

Problem 2: High CPU load

Causes and Troubleshooting

Problem 3: Constant Full GC

Memory Model

Garbage Collector

Key Common Parameters

Causes and Troubleshooting

Potential Causes

Problem 4: Full Thread Pool

Causes and Troubleshooting

Problem 5: NoSuchMethodException

Causes and Troubleshooting

Frequently Used Tools

Common Commands

Arthas

Basic Functions

Correcting Common Problems

Original Source:

Troubleshooting Common Java Performance Problems

Alibaba Clouder August 18, 2020 146 By Chang Haiyun (Yida) This article describes how to troubleshoot and fix common…

Written by Alibaba Cloud

No responses yet