Node.js Application Troubleshooting Manual — Outline and General Problem Metrics

7 min readJul 4, 2019

By Yijun

Intro

Have you ever wanted to try Node.js for application development, but always heard that it is unsafe and unstable? And, have you ever wanted to promote and expand the scope and influence of front-end services, but you can’t convince the technical leaders?

Nowadays, JavaScript has grown far from the “battlefield” of the original browser, and has extended its reach to the server, PC cross-platform client solutions, and other fields, with the help of the Node.js. However, at the same time, JS Runtime is still in a black box status for the vast majority of developers, in which case, developers aren’t aware of the running status, and do not have a good tool chain for further support when performance or memory problems occur.

Based on the Node.js Performance Platform, this manual introduces how to find, locate and solve the problems that may be encountered in the process of development and launch, so that readers can have more confidence in Node.js.

This manual belongs to the advanced part of Node.js development, so we hope that readers of this manual have the following basic skills:

Basic capability for Node.js application development
Understanding of common server performance metric parameters, such as CPU, Memory, Load, and number of opened files
Understanding of common database, cache, and other operations
Understanding of load balancing, and the multi-process model
Understanding of the basic knowledge and resource management of container, if containers are used

This manual is first published on GitHub at https://github.com/aliyun-node/Node.js-Troubleshooting-Guide and will be simultaneously updated to the community.

Note: At the time of writing, the Node.js Performance Platform product is only available for domestic (Mainland China) accounts.

Metrics for Routine Troubleshooting

Many people may feel overwhelmed the first time they encounter an online exception. Preliminarily, this section will start with the common troubleshooting metrics when the server is abnormal, to help you build a more intuitive problem handling system.

After all, if we do not know what is wrong with the system when facing online exceptions, then it will be impossible for us to locate the problem code further with the help of the Node.js Performance Platform.

Error Logs

When a problem occurs with our application, we first need to check the error log of our application to see if any errors have been thrown during this period, resulting in the instability of our service.

This information obviously varies from application to application. When our project is relatively large (the number of ECS/Docker nodes is large), it is necessary to collect error logs in a unified manner to ensure that we can quickly locate problems that occur. A simple unified log platform can be designed as follows:

Message queuing (Kafka) is usually used as a buffer between the Collector and Agent to reduce the load on both sides. ELK is a relatively mature log service.

With the unified log platform, when an error occurs in the application, we should first go to the log platform to view the current error log information, and especially the error logs that frequently appear should be alerted. Combined with the code segment that generated the error, we need to carefully backtrack to confirm whether it is the root cause of the current service instability. Node.js Performance Platform also implements a simple system of error log backtracking + alarm, which will be explained in more detail in the second part of this manual.

System Metrics

If no suspicious information is displayed during the above error log (in fact, the sequence of checking the error log and the system metrics in this section is not fixed, and you can choose which to execute first based on you own needs), then we should pay attention to whether the problem is caused by the load of the server or the Node.js application itself reaching the limit. Some common system metrics that you need to pay attention to are as follows:

CPU & Memory
Disk usage
I/O load
TCP connection status

The following describes these system metrics that may have errors one by one.

1. CPU & Memory

Use Top command to observe the load of CPU and Memory of the Node.js process. In general, for the Node.js process with high CPU usage, we can use the CPU Profiling tool provided by the Node.js Performance Platform to dump the current JavaScript running situation online, and then find the hotspot code for optimization, which will be explained in more detail in the second part of this manual.

If the Memory load is high, normally a memory leak occurs (or unexpected memory allocation results in memory overflow). Similarly, we can use the tools provided by the performance platform to dump the current JavaScript heap memory and perform service-oriented analysis, to find the logic that causes the leak in combination with the service code.

Note that, currently, the performance platform can perform detailed analysis on the JS code, and it does not provide a better way to analyze and process the logic that is completely executed by C++ extension or by V8/Libuv bottom layer (these functions will be supplemented later), and the memory that is not allocated on V8 Heap. Actually, in most of the cases we have encountered, errors occur in JS code written, that is to say, the relatively comprehensive system of “online dumping + service-oriented analysis” for JS currently provided by the performance platform can basically solve 95% or more of the problems experienced by developers.

2. Disk Usage

Use the df command to observe the current disk usage. This is also a very common problem that many developers may ignore the monitoring alarm for the server disk. When the log dump, core dump, and other large files gradually fill up the disk to 100%, the Node.js application may fail to run normally. The Node.js Performance Platform currently also provides monitoring for the disk, which will also be explained in more detail in the second part of this manual.

3. I/O Load

Use top/iostat and cat/proc/${pid}/io to view the current I/O load. If the load is very high, the Node.js application will crash.

4. TCP Connection Status

The vast majority of Node.js applications are actually Web applications. A socket connection will be created for each user. Under some abnormal circumstances (such as suffering from the semi-connection attack, or unreasonable kernel parameter setting), a large number of TIME_WAIT status connections will exist on the server, while a large backlog of TIME WAIT can cause the Node.js application to crash (the kernel cannot create a new TCP connection for the new request). We can use netstat -ant|awk ‘/^tcp/ {++S[$NF]} END {for(a in S) print (a,S[a])}’ command to confirm the problem.

Core Dump

Online Node.js application failures are often accompanied with crashes of processes. With the help of self-check and restart of some daemon processes, our services are still running, but we should not ignore these unexpected crashes — when traffic increases causing server problems, or user access is intercepted by people with ulterior motives, the cluster becomes vulnerable.

In most cases, the error log that causes the Node.js application to crash is not recorded in the error log file. Fortunately, the server kernel provides a mechanism to automatically generate core dump files when the application crashes, so that developers can analyze and restore the scene afterwards.

Core Dump

Core dump actually means that when an application unexpectedly crashes and terminates, the computer automatically records the memory allocation information, program counter, stack pointer, and other key information at the moment of application crash to generate a core dump file. Therefore, after obtaining the core dump file, we can analyze and diagnose the actual cause for the crash of the application through MDB, GDB, LLDB and other tools.

Generate the File

Currently, two ways are available to trigger the core dump to generate a dump file:

1. Set Kernel Parameters

Use ulimit -c unlimited to enable the kernel restriction. Considering that Node.js does not trigger the core dump action for the crash caused by JS code under the default operation mode, we can add the parameter --abort-on-uncaught-exception when the Node application starts, to enable the kernel to trigger automatic kernel dump action when an uncaught exception occurs.

2. Call Manually

The file is generated manually by manually calling gcore <pid> (sudo permission may be required). At this time, the Node.js application is still running, so in fact, this method is generally used for the "Living Test" to locate problems in the suspended status of the Node.js process.

It should be noted that the above operations for generating the core dump file are not that safe. Be sure to monitor the server disk, and alert when problems occur.

After obtaining the core dump file generated by Node.js application, we can use the online core dump file analysis function provided by Node.js Performance Platform to analyze and locate the cause for the crash of the process. The specific usage will be explained in the second part of this manual.

Summary

This section gives you a general understanding of how to troubleshoot and locate online Node.js applications when they fail from several common server problems. This section is also preliminary knowledge for the follow-up content. Understanding this content will clarify why we chose to analyze some of the key points (instead of others) in a detailed service-oriented way in the following practical cases.

The in-depth analysis of the core dump can help us solve most of the underlying failures of the Node.js application, because it can restore the JavaScript code with errors, the parameters that cause the problem, and has very powerful functions.

Original Source

https://www.alibabacloud.com/blog/node-js-application-troubleshooting-manual---outline-and-general-problem-metrics_594969?spm=a2c41.13092453.0.0

Node.js Application Troubleshooting Manual — Outline and General Problem Metrics

Intro

Metrics for Routine Troubleshooting

Error Logs

System Metrics

1. CPU & Memory

2. Disk Usage

3. I/O Load

4. TCP Connection Status

Core Dump

Core Dump

Generate the File

1. Set Kernel Parameters

2. Call Manually

Summary

Original Source

Written by Alibaba Cloud

No responses yet