Diagnosing ECS Faults with Serial Port Logs
Elastic Compute Service (ECS) is an elastically expandable computing service that can help you reduce IT costs and improve maintenance efficiency, so that you can concentrate on core business innovations. Alibaba Cloud conforms to strict Internet Data Center (IDC) standards, server access standards, and maintenance standards to provide reliable data, as well as a highly available cloud computing basic framework and ECSs.
However, you may still encounter system kernel crashes when using ECS instances. These kernel crashes, caused by incorrect operating system configuration, program overloading, or other reasons, can lead to problems such as system inaccessibility, abnormal reboot, or boot failure. To find the root cause and prevent such problems in the future, maintenance engineers need to check and analyze system logs. However, your ECS instances may fail to be properly connected through SSH, making it very difficult to locate the cause of the fault. This, fortunately, is no longer a problem because Alibaba Cloud has provided one-click system log view and screen capture features, allowing maintenance engineers to conveniently diagnose system faults and exceptions.
Why Do You Need System Serial Port Logs?
When ECS instances are down, reboot abnormally, or fail to boot, maintenance engineers need to locate the root cause of the problem, resolve the problem, and prevent it from happening in the future.
Faults affecting stable ECS instances mainly fall into two types:
- Faults of the hardware infrastructure and software environment on which ECS instances run
- Faults of the operating system environment on which ECS instances run
For the first type of faults, Alibaba Cloud provides system event information for you to understand the impact of such faults on ECS instances. For the second type of faults, you need to check operating system logs to find the problem. These faults include operating system kernel bugs, incorrect system configuration, and program overloading. For more information about ECS system events, see System Events.
After related configuration is made on the Linux operating system, boot logs and information about faults and exceptions are printed through the servers’ serial ports. For physical servers, maintenance engineers use Intelligent Platform Management Interface (IPMI) out-of-band interfaces to obtain the logs that the operating system prints through serial ports. For ECSs, maintenance engineers also need such logs to help diagnose faults and exceptions, so the ECS system serial port logs are important for maintenance and diagnosis.
What Is in System Serial Port Logs?
The system uses serial ports to print two types of logs, namely, system boot logs and system kernel fault or exception logs.
- When the Linux operating system boots, the system prints logs about information generated during the boot. The boot information, including information about the system architecture, CPU, RAM, mounted hardware, and software boot, is stored by the system kernel in the ring buffer. Such information helps the system administrator check whether the system started properly and whether preset application programs booted along with the system.
- When a kernel fault or exception occurs, the system prints log information based on the log level specified by the kernel parameter kernel.printk (which is set to 4 by default). Kernel panic occurs when the operating system detects some internal critical errors that the operating system cannot safely handle. The subprogram for handling kernel panic in the operating system kernel is usually designed to print error information to the serial port console for debugging. It then waits for the system to automatically reboot or be manually rebooted. The technical information provided by the subprogram is often used to help the system administrator or software developers diagnose problems.
Log levelNameDescription0KERN_EMERGThe system is unusable.1KERN_ALERTActions that must be taken care of immediately.2KERN_CRITCritical conditions3KERN_ERRNoncritical error conditions.4KERN_WARNINGWarning conditions that should be taken care of.5KERN_NOTICENormal, but significant events.6KERN_INFOInformational messages that require no action.7KERN_DEBUGKernel debugging messages, output by the kernel if the developer enabled.
How Do I Use System Serial Port Logs?
On the ECS console, you can obtain the system logs of the ECS instances in the running state through the following operations in the instance list or on the instance details page.
- Log on to the ECS console.
- Click Instance in the left-side navigation pane.
- Select Area.
- Find the Operation menu of the instance to troubleshoot.
- Choose More > Maintenance and diagnosis > Obtain instance system logs to view the logs.
- Alternatively, you can click an instance to access the Instance details page, and choose More > Obtain instance system logs to view the logs.
The following figure shows the boot logs generated when the system has booted successfully.
The following figure shows the error information generated when a kernel error occurs.
Maintenance engineers can use the log information to determine whether the system boot was successful and diagnose operating system exceptions and faults.
The system may also push exception information to a display (such as a blue screen on Windows) at certain times. However, ECSs are not connected to physical displays, so you can obtain instance screenshots to view the information pushed to the display when the exception occurred and analyze the problem accordingly.
You can call open APIs GetInstanceConsoleOutput to obtain the logs and GetInstanceScreenshot to obtain screenshots. Both are encoded in Base64 format.
Note that you can only check the system logs and screenshots of instances in the running state.
Alibaba Cloud Elastic Compute Service (ECS) provides proactive maintenance and system events to help you discover the impact of infrastructure faults and exceptions on ECS operation in advance. It also allows maintenance engineers to detect the faults and exceptions in time to take preventive measures to protect ongoing services. Moreover, the diagnostic log function introduced today can help maintenance engineers find the root causes of instance exceptions caused by operating system internal errors that can interrupt services and prevent the future occurrence of such problems.
The Alibaba Cloud ECS team is on a never-ending mission to make ECS better for our customers. We will be launching more maintenance tools and capabilities in the near future to give you a more reliable and visible ECS.