Troubleshooting Common Java Performance Problems

Image for post
Image for post

By Chang Haiyun (Yida)

Image for post
Image for post

This article describes how to troubleshoot and fix common performance issues and faults that occur when using Java. It also gives several helpful practical methods.

Problem 1: High CPU Utilization

CPU utilization is an important metric for measuring how busy the system is. Generally, high CPU utilization is not a problem because it indicates that the system is continuously processing tasks. However, if the CPU utilization is so high that tasks are piling up and causing a high system load, it becomes dangerous to the system and requires troubleshooting. There is no standard metric value for safe CPU utilization because CPU utilization varies depending on whether your system is compute-intensive or I/O-intensive. Generally, a compute-intensive system has a higher CPU utilization and lower load. This is the opposite for an I/O-intensive system.

1) Frequent Full Garbage Collection (GC) or Young GC

  • Check the GC log.
  • Run jstat -gcutil pid to view memory usage and GC condition.

2) Abnormal code-related CPU consumption, such as the consumption caused by endless loops, MD5 operations, and other memory operations

3) Troubleshoot with Arthas

  • Run thread -n 5 to view the top 5 threads with the highest CPU utilization, including the stack. See the following section for details.

4) Troubleshoot with the jstack command

  • Run ps –ef | grep java to retrieve the Java process ID.
  • Run top -Hp pid to identify the thread with the highest CPU utilization.
  • Run printf '0x%x' tid to convert the thread ID to the hexadecimal format.
  • Run jstack pid | grep tid to identify the thread stack.
Image for post
Image for post

Note: You can enter “1” to view the status of each CPU when running this command. We have seen a case in which a CPU was bound to middleware, causing a spike in CPU utilization.

Problem 2: High CPU load

The CPU load refers to the number of active processes per unit time, including processes in running states (runnable and running) and uninterruptible states (I/O lock and kernel lock.) As you can see, the keywords in this case are “running states” and “uninterruptible states.” The running states can be referred to as the six states of a Java thread, as shown in the following figure. The thread is in a new state after being initialized, then enters the runnable state, and waits for CPU scheduling after being started. In this case, busy CPUs will produce an increasing number of processes in the runnable state. The uninterruptible states include network I/O lock, disk I/O lock, and kernel lock when the thread is in the synchronized state.

Image for post
Image for post

1) High CPU utilization with a large number of processes in the runnable state

  • To troubleshoot, see Problem 1.

2) High iowait value for pending I/O

  • Run vmstat to check for blocked processes.
  • Run jstack -l pid | grep BLOCKED to check the blocked thread stack.

3) To troubleshoot, wait for the unlock of kernel lock, for example, when the thread is in the synchronized state.

  • Run jstack -l pid | grep BLOCKED to check the blocked thread stack.
  • Use profiler to open the dump file of the thread stack and analyze the locking condition of threads.

Problem 3: Constant Full GC

Before we learn about the causes for a Full GC, let’s review Java Virtual Machine (JVM) memory.

New objects are placed in the Eden space. When the Eden space becomes full, it triggers a Minor GC and moves living objects to S0.

Later, when the Eden space becomes full again, it triggers another Minor GC and moves both living objects and the objects in S0 to S1. In this case, S0 or S1 must be empty.

This cycle repeats until S0 or S1 is about to be full. Objects inside the full space will be moved to the old generation. When the old generation also becomes full, a Full GC is triggered.

For versions earlier than JDK 1.7, Java class information, the constant pool, and static variables are stored in the permeant generation, and the metadata and static variables of a class are imported into the permeant generation when the class is loaded, and are cleared when the class is uninstalled. In JDK 1.8, the metaspace replaces the permeant generation and native memory is used. In addition, the constant pool and static variables are moved to the heap space, which to some extent solves the Full GC problem that occurs when a large number of classes are generated or loaded during runtime, for example, during reflection, proxy, and groovy operations.

The young generation often uses ParNew, replication algorithms, and multi-thread parallelism.

The old generation often uses the Concurrent Mark Sweep (CMS) algorithm (which incurs memory fragmentation) and concurrent collection (which involves objects generated by user threads.)

  • CMSInitiatingOccupancyFraction indicates the old generation occupancy at which a Full GC is triggered.
  • UseCMSCompactAtFullCollection indicates that the old generation memory is defragmented after a Full GC to avoid memory fragmentation.

1) Promotion Failed

Objects promoted from the S space are too big for the old generation, triggering a Full GC. If the Full GC fails, an out-of-memory (OOM) error is thrown.

The survivor space is too small and the objects enter the old generation too early.

  • Run jstat -gcutil pid 1000 to check the running condition of memory.
  • Run jinfo pid to check the SurvivorRatio parameter.

The capacity of memory is insufficient for allocating large objects.

  • Search for the “allocating large” keywords in the log.
  • Use profiler to view the memory status and the distribution of large objects.

The old generation contains a large number of objects.

  • Run jmap -histo pid | sort -n -r -k 2 | head -10 to retrieve the top 10 classes with the greatest number of instances.
  • Run jmap -histo pid | sort -n -r -k 3 | head -10 to retrieve the top 10 classes with the largest instance capacity.
  • In the heap dump file, use profiler to analyze the memory usage of different objects.

2) Concurrent Mode Failed

During the CMS GC process, the business thread runs out of memory when moving objects into the old generation, which is common to the concurrent collection.

1) The extent of triggering a Full GC is too large, causing a high occupancy in the old generation. Meanwhile, user threads keep generating objects during concurrent collection, reaching the threshold of triggering a Full GC.

  • Run the jinfo command to check that the value of the CMSInitiatingOccupancyFraction parameter ranges from 70 to 80.

2) Memory fragmentation occurs in the old generation.

  • Run the jinfo command to check the UseCMSCompactAtFullCollection parameter and sort out the memory after a Full GC.

Problem 4: Full Thread Pool

Use a Java thread pool that uses a bounded queue as an example. When a new task is submitted, if the number of running threads is less than corePoolSize, another thread is created to process the request. If the number of running threads is equal to corePoolSize, new tasks are queued until the queue becomes full. When the queue is full, new threads are created to process existing tasks, but the number of the threads does not exceed maximumPoolSize. When the task queue is full and the maximum number of threads is reached, ThreadPoolExecutor denies service for upcoming tasks.

1) The downstream response time (RT) is high and the timeout period is inappropriate.

  • Business Monitoring
  • Sunfire
  • EagleEye

2) Slow SQL queries or database deadlock occurs.

  • Search the keywords “Deadlock found when trying to get lock” in the log.
  • Use the jstack or zprofiler command to identify blocked threads.

3) Java code deadlock occurs.

  • Run jstack –l pid | grep -i –E 'BLOCKED | deadlock to check for deadlock.
  • In the thread dump file, use zprofiler to analyze blocked threads and locks.

Problem 5: NoSuchMethodException

1) JAR Package Conflict

When Java loads all JAR packages under the same directory, the loading order fully depends on the operating system.

  • Run mvn dependency:tree and analyze the version of the JAR package with the error. If conflicted JAR package versions are found, always leave the one with the later version while removing the other.
  • Similarly, run arthas:sc -d ClassName and XX:+TraceClassLoading to check for class conflict.

2) Same Classes

  • ClassNotFoundException
  • NoClassDefFoundError
  • ClassCastException

Frequently Used Tools

1) tail

  • -f: traces the file.

2) grep

  • -i: ignores case.
  • -v: performs reverse lookup.
  • -E: extends the regular expression, for example, grep -E ‘pattern1|pattern2’ filename.

3) pgm

  • -b: enables parallel processing.
  • -p: specifies the parallelism.
  • -A: enables askpass.

4) awk

  • -F: specifies the delimiter, for example, awk -F “|” ‘{print $1}’ | sort -r | uniq -c.

5) sed

  • Time period matching: sed ‘/2020–03–02 10:00:00/,/2020–03–02 11:00:00/p’ filename

Alibaba open-source Java diagnostics tool, Arthas, uses the instrumentation method based on JavaAgent to modify bytecode for Java application diagnosis.

  • Dashboard: indicates the real-time dashboard that allows you to view information, such as threads, memory usage, and GC condition.
  • Thread: indicates JVM thread stack information, such as the top n threads that are the busiest.
  • Getstatic: retrieves static attribute values, for example, getstatic className attrName. It can be used to view the actual values of online toggles.
  • sc: retrieves the classes loaded to JVM. It can be used to check JAR package conflicts.
  • sm: retrieves method information of classes loaded in JVM.
  • jad: decompiles JVM class loading information so you can troubleshoot the code logic execution failure.
  • watch: collects the execution data of a method, including input parameters, output parameters, and exceptions.
  • watch xxxClass xxxMethod " {params, throwExp} " -e -x 2
  • watch xxxClass xxxMethod "{params,returnObj}" "params[0].sellerId.equals('189')" -x 2
  • watch xxxClass xxxMethod sendMsg '@com.taobao.eagleeye.EagleEye@getTraceId()'
  • trace: retrieves the internal call duration of a method and outputs the time consumed by each node. It can be used for performance analysis.
  • tt: records a method and plays it back.

Correcting Common Problems

1) The thread pool is full.

  • If the thread pool of the Remote Procedure Call (RPC) framework is full, limit the number of threads for interfaces with a high RT.
  • If the thread pool of an application is full, restart the application to temporarily alleviate the problem, but further correction action depends on the actual cause of the problem.

2) The CPU utilization and load are high.

  • Replace or restart the server to temporarily alleviate the problem, but further correction action depends on the actual cause of the problem.
  • Scale out if the load of the cluster is high and the traffic increases significantly, but further correction action depends on the actual cause of the problem

3) The downstream RT is high.

  • Throttling
  • Degradation

4) Database Issues

  • For database deadlock, kill the problem process.
  • For slow SQL queries, perform SQL throttling.

The troubleshooting of online problems requires accumulated experience. To find the cause and eliminate the problem, you must understand the principles behind the problems. In addition, useful tools can help lower the threshold for troubleshooting and quick recovery.

The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store