Node.js Application Troubleshooting Manual — Outline and General Problem Metrics

Intro

Have you ever wanted to try Node.js for application development, but always heard that it is unsafe and unstable? And, have you ever wanted to promote and expand the scope and influence of front-end services, but you can’t convince the technical leaders?

  • Basic capability for Node.js application development
  • Understanding of common server performance metric parameters, such as CPU, Memory, Load, and number of opened files
  • Understanding of common database, cache, and other operations
  • Understanding of load balancing, and the multi-process model
  • Understanding of the basic knowledge and resource management of container, if containers are used

Metrics for Routine Troubleshooting

Many people may feel overwhelmed the first time they encounter an online exception. Preliminarily, this section will start with the common troubleshooting metrics when the server is abnormal, to help you build a more intuitive problem handling system.

Error Logs

When a problem occurs with our application, we first need to check the error log of our application to see if any errors have been thrown during this period, resulting in the instability of our service.

System Metrics

If no suspicious information is displayed during the above error log (in fact, the sequence of checking the error log and the system metrics in this section is not fixed, and you can choose which to execute first based on you own needs), then we should pay attention to whether the problem is caused by the load of the server or the Node.js application itself reaching the limit. Some common system metrics that you need to pay attention to are as follows:

  • CPU & Memory
  • Disk usage
  • I/O load
  • TCP connection status

1. CPU & Memory

Use Top command to observe the load of CPU and Memory of the Node.js process. In general, for the Node.js process with high CPU usage, we can use the CPU Profiling tool provided by the Node.js Performance Platform to dump the current JavaScript running situation online, and then find the hotspot code for optimization, which will be explained in more detail in the second part of this manual.

2. Disk Usage

Use the df command to observe the current disk usage. This is also a very common problem that many developers may ignore the monitoring alarm for the server disk. When the log dump, core dump, and other large files gradually fill up the disk to 100%, the Node.js application may fail to run normally. The Node.js Performance Platform currently also provides monitoring for the disk, which will also be explained in more detail in the second part of this manual.

3. I/O Load

Use top/iostat and cat/proc/${pid}/io to view the current I/O load. If the load is very high, the Node.js application will crash.

4. TCP Connection Status

The vast majority of Node.js applications are actually Web applications. A socket connection will be created for each user. Under some abnormal circumstances (such as suffering from the semi-connection attack, or unreasonable kernel parameter setting), a large number of TIME_WAIT status connections will exist on the server, while a large backlog of TIME WAIT can cause the Node.js application to crash (the kernel cannot create a new TCP connection for the new request). We can use netstat -ant|awk ‘/^tcp/ {++S[$NF]} END {for(a in S) print (a,S[a])}’ command to confirm the problem.

Core Dump

Online Node.js application failures are often accompanied with crashes of processes. With the help of self-check and restart of some daemon processes, our services are still running, but we should not ignore these unexpected crashes — when traffic increases causing server problems, or user access is intercepted by people with ulterior motives, the cluster becomes vulnerable.

Core Dump

Core dump actually means that when an application unexpectedly crashes and terminates, the computer automatically records the memory allocation information, program counter, stack pointer, and other key information at the moment of application crash to generate a core dump file. Therefore, after obtaining the core dump file, we can analyze and diagnose the actual cause for the crash of the application through MDB, GDB, LLDB and other tools.

Generate the File

Currently, two ways are available to trigger the core dump to generate a dump file:

1. Set Kernel Parameters

Use ulimit -c unlimited to enable the kernel restriction. Considering that Node.js does not trigger the core dump action for the crash caused by JS code under the default operation mode, we can add the parameter --abort-on-uncaught-exception when the Node application starts, to enable the kernel to trigger automatic kernel dump action when an uncaught exception occurs.

2. Call Manually

The file is generated manually by manually calling gcore <pid> (sudo permission may be required). At this time, the Node.js application is still running, so in fact, this method is generally used for the "Living Test" to locate problems in the suspended status of the Node.js process.

Summary

This section gives you a general understanding of how to troubleshoot and locate online Node.js applications when they fail from several common server problems. This section is also preliminary knowledge for the follow-up content. Understanding this content will clarify why we chose to analyze some of the key points (instead of others) in a detailed service-oriented way in the following practical cases.

Original Source

https://www.alibabacloud.com/blog/node-js-application-troubleshooting-manual---outline-and-general-problem-metrics_594969?spm=a2c41.13092453.0.0

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com