Windows Networking Troubleshooting 3: Network Bugs triggered

Cause

Recently, an interesting bug has emerged. Some users report that their servers have been connecting to Alibaba Cloud services, occupying all ports and causing network interruptions, service connection failures, and making the local Telnet port unreachable.

From the perspective of product design, this situation is generally not possible, but the Netstat results of the users do show that most of the connections are established with Alibaba Cloud servers.

Image for post
Image for post

The servers are all non-Alibaba Cloud instances, so it is not possible to log on the servers directly for troubleshooting. Therefore, a troubleshooting solution is prepared.

Troubleshooting + Analysis

1. The bug causing the local Telnet port to be unreachable is generally caused by the exhaustion of dynamic ports. By default, the range of ports used by systems after Windows Server 2008 is 49152–65536, a total of 16384 ports. In general, they can not be used up. If this situation occurs, it is often accompanied by a port leak.

netsh int ipv4 show dyn tcp
Protocol tcp Dynamic Port Range
Start Port: 49152
Number of Ports: 16384

One way to verify this is to open more dynamic ports through this command: netsh int ipv4 set dynamicport tcp start=10000 num=50000.

The netsh command directly calls the TCPIP method for operating Windows Kernel through the NSI interface to modify the system configuration, so the command takes effect immediately.

2. To confirm whether the port is leaking, it is generally necessary to capture the dump to analyze the cause of the port leak. However, we have actually dealt with a lot of these types of bug, most of which are caused by third-party drivers. Therefore, it is recommended to check whether any third-party software is installed first. If so, try uninstalling the software, and then see if the problem is solved.

In this case, the user reported that no third-party software was installed.

3. Considering that most of the dynamic ports in the system are in the TIME_WAIT status, you can try to shorten the TIME_WAIT release time through the TcpTimedWaitDelay registry.

https://docs.microsoft.com/en-us/biztalk/technical-guides/settings-that-can-be-modified-to-improve-network-performance

Key: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Value: TcpTimedWaitDelay
Data Type: REG_DWORD
Range: 30-300 (decimal)
Default value: 0x78 (120 decimal)
Recommended value: 30
The registry configuration takes effect only after a restart.

4. Users have always had doubts about the IP addresses of Alibaba Cloud servers accessed by their servers. However, most of them are accessed through port 443, so it is hard to prove exactly which application generated the connection by capturing packets. Therefore, the ETL Trace method provided by Windows is required to obtain the calling information of the application to Network APIs, such as TCPIP.

netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:NetworkTrace.etl capture=yes packettruncatebytes=128

From the collected logs, the Process ID corresponding to the request is 0x0600 = 1536.

16767 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.978 [Microsoft-Windows-TCPIP]TCP: endpoint 0xFFFFFA80C4CF0CF0 (Family=IPV6 , PID=1536) created with status = STATUS_SUCCESS. 
16768 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]socket: 1: Process 0xFFFFFA80C3FB0860 (0x600), Endpoint 0xFFFFFA80C259FE50, Family 0, Type 0, Protocol 0, Seq 1013, Status STATUS_SUCCESS
16769 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x1, Seq 11002, Status STATUS_SUCCESS
16770 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address ::, Seq 7010, Status STATUS_SUCCESS
16771 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 acquired port number 10010.
16772 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) bound.
16773 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::]:10010, Seq 7022, Status STATUS_SUCCESS
16774 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]connect: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::ffff:120.55.35.39]:443, Seq 5023, Status STATUS_SUCCESS
16775 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosedState to SynSentState , SndNxt = 0.
16776 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 attempted to acquire weak reference on port number 10010 inherited from endpoint 0xFFFFFA80C4CF0CF0. Successful = TRUE .
16777 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) requested to connect.
16779 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 is going to output SYN with ISN = 2438595996, RcvWnd = 8192, RcvWndScale = 8.
16780 [3]0600.061C::? 2018? -? 12? -? 24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect proceeding.
16789 [7]0000.0000::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from SynSentState to EstablishedState , SndNxt = 2438595997.
16790 [7]0000.0000::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1. ThSeq = 2246241665.
16791 [7]0000.0000::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect completed. PID = 1536.
16793 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]connect: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 5024, Status STATUS_SUCCESS
16794 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x0, Seq 11002, Status STATUS_SUCCESS
16795 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.009 [Microsoft-Windows-TCPIP]TCP: Option TCP_OFFLOAD_NOT_PREFERRED is going to be set for connection 0xFFFFFA80C3956B20.
16796 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3047, Status STATUS_SUCCESS
16797 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3056, Status STATUS_SUCCESS
16798 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 send posted posted 279 bytes at 2438595997.
16807 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3051, Status STATUS_SUCCESS
16815 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3024, Status STATUS_SUCCESS
16818 [7]3ECC. 2FC8::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2C56830, Length 1452, Seq 9000
16819 [7]3ECC. 2FC8::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246241665.
16820 [7]3ECC. 2FC8::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246241665.
16823 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4115, Status STATUS_SUCCESS
16824 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2D36F50, Length 1452, Seq 9000
16825 [3]0600.061C::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4116, Status STATUS_SUCCESS
16826 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246243117.
16827 [6]0000.0000::? 2018? -? 12? -? 24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246243117.
...
17644 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from EstablishedState to FinWait1State , SndNxt = 2438604647.
17650 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2002, Status STATUS_SUCCESS
17651 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) closed.
17652 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 released port number 10010. WeakReference = FALSE .
17653 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2003, Status STATUS_SUCCESS
17654 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2000, Status STATUS_SUCCESS
17655 [4]0600. A53C::? 2018? -? 12? -? 24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2001, Status STATUS_SUCCESS
17664 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from FinWait1State to ClosingState , SndNxt = 2438604648.
17665 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 delivering FIN. RcvNxt = 2246247426.
17666 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-Winsock-AFD]disconnect indicated: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 12001
17671 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosingState to TimeWaitState , SndNxt = 2438604648.
17673 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) disconnect completed.
17675 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) close issued.
17676 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) shutdown initiated (The transport connection was aborted by the local system.). PID = 1536.
17677 [7]0000.0000::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from TimeWaitState to ClosedState , SndNxt = 2438604648.
17678 [2]0004.0058::? 2018? -? 12? -? 24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C3956B20 released port number 10010. WeakReference = TRUE .

When PID is 1536, the corresponding Java.exe can be found through Process Explorer.

Image for post
Image for post

5. In step 4, we have found that the connection request to Alibaba Cloud server 120.55.35.x is triggered by CloudMonitor.

Considering that the operating system without TcpTimedWaitDelay configured uses 2MSL (120s) TIME_WAIT timeout by default, if nearly 15000 ports of a system are in the TIME_WAIT status, theoretically, CloudMonitor needs to generate at least 125 connections per second. However, this type of behavior wasn’t found in the CloudMonitor log or the previous ETL Trace log. The CloudMonitor program regularly establishes connections every few seconds to submit “metrics”.

41931919752 INFO 2018-12-24 17:57:17.748 [ricGatherServiceHttp] metric submission completed, time consumed: 78 ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
41931935181 INFO 2018-12-24 17:57:33.177 [ricGatherServiceHttp] metric submission completed, time consumed: 47 ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
41931950625 INFO 2018-12-24 17:57:48.621 [ricGatherServiceHttp] metric submission completed, time consumed: 47 ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]

6. The bug is quite strange. After all other possibilities are excluded, the only possibility left is that the windows system does not release the TIME_WAIT port after 2MSL, and subsequent troubleshooting confirms this. Unfortunately, we could not find the specific cause from the ETL Trace log, but the user reminds us with an unintentional sentence,

“This server has been running for a long time without any bugs.”

How long has it been running specifically? Through the Eventlog 6013, we confirmed that the server had been running for 43225197s, nearly 500 days. This caught our attention.

All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2

https://support.microsoft.com/en-us/help/2553549/all-the-tcp-ip-ports-that-are-in-a-time-wait-status-are-not-closed-aft

So far, the bug has been thoroughly analyzed and solved.

Cause Analysis

Let me briefly explain the cause of this bug. In a Windows system, the time to release TCP endpoints in the TIME_WAIT status is determined by the current system running time plus the TIMEOUT. In a system without patches installed, the maximum value that can be stored in the 32-bit register is 0xFFFFFFFF, and the value stored in the register is “time * 0n100”, that is to say, bugs may occur after the system runs for more than “0XFFFFFFFF/0n100” seconds.

0xFFFFFFFF/(0n86400 * 0n100)= 497.1026961805...

Reference:https://www.alibabacloud.com/blog/windows-networking-troubleshooting-3-network-bugs-triggered_594839?spm=a2c41.12911071.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store