Windows Networking Troubleshooting 3: Network Bugs triggered


Recently, an interesting bug has emerged. Some users report that their servers have been connecting to Alibaba Cloud services, occupying all ports and causing network interruptions, service connection failures, and making the local Telnet port unreachable.

From the perspective of product design, this situation is generally not possible, but the Netstat results of the users do show that most of the connections are established with Alibaba Cloud servers.

The servers are all non-Alibaba Cloud instances, so it is not possible to log on the servers directly for troubleshooting. Therefore, a troubleshooting solution is prepared.

Troubleshooting + Analysis

1. The bug causing the local Telnet port to be unreachable is generally caused by the exhaustion of dynamic ports. By default, the range of ports used by systems after Windows Server 2008 is 49152–65536, a total of 16384 ports. In general, they can not be used up. If this situation occurs, it is often accompanied by a port leak.

One way to verify this is to open more dynamic ports through this command: netsh int ipv4 set dynamicport tcp start=10000 num=50000.

The netsh command directly calls the TCPIP method for operating Windows Kernel through the NSI interface to modify the system configuration, so the command takes effect immediately.

2. To confirm whether the port is leaking, it is generally necessary to capture the dump to analyze the cause of the port leak. However, we have actually dealt with a lot of these types of bug, most of which are caused by third-party drivers. Therefore, it is recommended to check whether any third-party software is installed first. If so, try uninstalling the software, and then see if the problem is solved.

In this case, the user reported that no third-party software was installed.

3. Considering that most of the dynamic ports in the system are in the TIME_WAIT status, you can try to shorten the TIME_WAIT release time through the TcpTimedWaitDelay registry.

4. Users have always had doubts about the IP addresses of Alibaba Cloud servers accessed by their servers. However, most of them are accessed through port 443, so it is hard to prove exactly which application generated the connection by capturing packets. Therefore, the ETL Trace method provided by Windows is required to obtain the calling information of the application to Network APIs, such as TCPIP.

From the collected logs, the Process ID corresponding to the request is 0x0600 = 1536.

When PID is 1536, the corresponding Java.exe can be found through Process Explorer.

5. In step 4, we have found that the connection request to Alibaba Cloud server 120.55.35.x is triggered by CloudMonitor.

Considering that the operating system without TcpTimedWaitDelay configured uses 2MSL (120s) TIME_WAIT timeout by default, if nearly 15000 ports of a system are in the TIME_WAIT status, theoretically, CloudMonitor needs to generate at least 125 connections per second. However, this type of behavior wasn’t found in the CloudMonitor log or the previous ETL Trace log. The CloudMonitor program regularly establishes connections every few seconds to submit “metrics”.

6. The bug is quite strange. After all other possibilities are excluded, the only possibility left is that the windows system does not release the TIME_WAIT port after 2MSL, and subsequent troubleshooting confirms this. Unfortunately, we could not find the specific cause from the ETL Trace log, but the user reminds us with an unintentional sentence,

“This server has been running for a long time without any bugs.”

How long has it been running specifically? Through the Eventlog 6013, we confirmed that the server had been running for 43225197s, nearly 500 days. This caught our attention.

All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2

So far, the bug has been thoroughly analyzed and solved.

Cause Analysis

Let me briefly explain the cause of this bug. In a Windows system, the time to release TCP endpoints in the TIME_WAIT status is determined by the current system running time plus the TIMEOUT. In a system without patches installed, the maximum value that can be stored in the 32-bit register is 0xFFFFFFFF, and the value stored in the register is “time * 0n100”, that is to say, bugs may occur after the system runs for more than “0XFFFFFFFF/0n100” seconds.


Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website: