Windows Networking Troubleshooting 1: Not Receiving Data Packets at NIC

It is often difficult to troubleshoot problems involving sending and receiving packets. This series of articles focus on network troubleshooting and summarizes troubleshooting procedures and analysis results to gradually provide a complete and deep analysis and study of NDIS (Network Driver Interface Specification) Framework and Qemu Virtio netkvm.

Problem

When Windows Server uses FTP to upload files, network exceptions are thrown during transmission, causing file transmission failures. Although the server network automatically recovers itself after a while, this problem often recurs during subsequent upload operations.

Collect Basic Information

Although the problem seems very clear, we need to understand more details:

1. The condition that triggers the network exceptions: On another client, use FTP to upload files.

2. Network exception diagnosis:

  • Enter ping 127.0.0.1 on this machine and the result is normal. This indicates that the TCP/IP protocol stack is working normally and that the problem lies in the underlying drivers. If the result of this command is abnormal, generally we use netsh.exe to reset TCP/IP and Winsock.
  • This is a classical network host. When the problem occurs, the network interface controller (NIC) for the intranet is working correctly. Specifically, ping this Windows Server on machines with the same intranet network segment and the result is normal. Ping the intranet DNS through the intranet NIC and the result is also normal.
  • When the problem occurs, this machine tries to ping the gateway configured on the NIC for the Internet and finds that it is not connected. Step 2 and Step 3 indicate that the NIC driver is working as expected. The problem may lie in the Internet NIC or its corresponding NDIS Miniport, including the NDIS_MINIPORT_BLOCK structure maintained by NDIS.sys and the _ADAPTER structure maintained by the NIC driver.
  • When we ping the gateway, the result of running ARP -a shows that the mapping between the MAC of the gateway and IP is incomplete. This indicates that sending data packets has already triggered ARP but response cannot be received.
  • Based on Step 4, we check the NIC device and find that the status is normal. However, the NIC properties (Network Connection -> NIC Properties) show that the count of sent packets has increased but that the received packets have not increased. This indirectly reflects that the ARP request is properly sent but no response is received.
  • Finally, we disable and then enable the NIC, and find that this problem is actually solved.

3. Based on the previous network exception diagnosis, we roughly know that this problem has something to do with ARP. However, after adding static ARP information (arp.exe -s gw.ipv4.address ee:ff:ff:ff:ff:ff) by using arp.exe -s, we find that we still cannot ping the gateway, which means that ARP is not the only reason.

Locate the Problem

To locate the problem more accurately, we perform packet capture and analysis on both the Windows machine and its host. Unfortunately, the captured packets are directly printed on the screen. No specific packet content is retained. Screenshots of subsequent packet capture operations will be available in similar scenarios.

Here is the result of the packet capture analysis. After static ARP is added, the result of the packet capture on the VIF interface shows that the pinged ICMP Echo Request message is sent from the VIF interface and the ICMP Echo Reply message is received.

Based on the preceding tests and diagnosis, we think that this problem lies in the underlying driver. We can capture and analyze the dump file in Windows directly from NC.

Tips

Windows has integrated packet capture capability since 2008 R2. This capability is implemented on NDIS.sys and works with the ETW mechanism on Windows to make troubleshooting easier and more convenient. To enable this capability, simply run the following command:

netsh trace start capture=yes

After the problem recurs, run the following command:

netsh trace stop

Captured log files are written into the temp directory of the current user. After we run the stop command, Windows will print the path to the log file in cmd.exe.

The file that this command captures can be opened with Microsoft Network Monitor 3.4 or Microsoft Message Analyzer. Currently Wireshark cannot recognize this file.

ETW (Event Tracing for Windows) is a good tool to troubleshoot OS component behaviors. Windows provides some ready-to-use scenarios and providers, which we can view by using netsh trace show scenarios and netsh trace show providers. Perhaps I will write a separate article to describe these available scenarios and providers later. Regarding networking, I list some providers related to NDIS, TCP/IP, Afd, and Winsock here. These providers are applicable for deep analysis of system network behaviors in general scenarios. The command is as follows.

netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:\NetworkTrace.etl capture=yes packettruncatebytes=128 IPv4. Address=<ipv4.address.for.filtering>

To learn more about the meaning of the command, see the netsh trace capture help.

Memory Dump Analysis

First, check the status of the NIC Miniport. It shows no exceptions. If a NIC is abnormal, we may typically look at general information like Pending OID or Reset. In this case, we can try to upgrade the NIC driver.

Image for post
Image for post

Next, check the send path, that is, information about sending request. In older versions of Windows, we can see the reference on mopen. Each time a request is sent, the tcpip.sys driver increases the reference of mopen for both TCP/IP and Miniport. After a message is sent, the NIC will invoke the callback function in the tcpip.sys driver to release the reference.

Image for post
Image for post

In versions later than Windows Server 2008 R2, the reference of mopen does not work any more. The count of sent requests is recorded in the Provider_Rundown_Protection of tcpip.sys to meet the capability of processing a sending request on different CPUs. The count in the Rundown targets each CPU. Addition and subtraction are performed to determine whether all pending NBLs have been sent.

References

Find the status information about sending and receiving requests by using the counts of other statuses. We find the statistics of Virtio netkvm and confirm that both sending and receiving is normal.

After further checking the data structure, we find that the NetReceiveBuffer List is empty and that NetNofReceiveBuffers is 0. This may happen because the NIC driver finds no buffer available, leading to the interruption of receiving packets. We have different options for the netkvm driver. For example, we can disable the NIC to prevent packet reception when the buffer is full.

Image for post
Image for post

Next, check the ParaNdis_ProcessRxPath function and the virtqueue_get_buf function to confirm that the ring buffer is full.

Image for post
Image for post

The Virtio netkvm code shows that the buffer content in the LIST_ENTRY data structure of NetReceiveBuffersWaiting is maintained by the Windows NDIS framework driver. The ReturnPacketHandler that the netkvm driver has registered (that is, netkvm) is invoked. ParaNdis5_ReturnPacket releases the buffer and returns the content to NetReceiveBuffers.

At this point, the central issue is why NDIS does not invoke the callback function. Windows mainly depends on the references (namely, the reference count) of the NET_BUFFER_LIST data structure to recycle buffer related to the NIC. If the buffer is used, its reference count will be increased by 1. If the buffer operation is completed, the driver corresponding to the reference will invoke Dereference to release the reference. The callback function will be invoked only when the reference count of the buffer becomes 0.

We can troubleshoot this problem by enumerating all unreleased buffers and printing the network packet structure. For example:

! list "-t \_LIST\_ENTRY.Flink -e -x \"dt netkvm! IONetDescriptor @$extret; dt ndis! _NDIS_PACKET poi(@$extret+0x40) Private.; dt _MDL poi(poi(@$extret+0x40)+8); db poi(poi(poi(@$extret+0x40)+8)+0x18) L0x50\" 0xfffffadf`37fa26d0"
Image for post
Image for post

Note: 0x0bda = 3034 is the data port used in the FTP Pasv mode.

In this case, the Serv-U FTP server does not process packet receipt, causing the buffer to be full. This problem can be solved by using the build-in IIS FTP in Windows.

Conclusion

Based on the previous analysis, we can generally make the following plans:

1. The NDIS driver on Windows itself is not properly releasing the buffer. We recommend that you install the latest ndis.sys patch.

2. Other third-party drivers have incorrect references to the buffer, resulting in the reference count unable to be 0. It is recommended to uninstall third-party drivers and keep the operating system clean.

3. Messages are not processed.

  • Possible cause 1: After packets are received, they need to be indicated to the upper-layer application through the messaging mechanism. Slow indication or deadlock may cause problems.
  • Possible cause 2: The application has a Critical Section, which prevents the recv operation. Any issues between tcpip.sys, afd.sys, Winsock, and applications themselves will cause network problems.

It is generally recommended to upgrade the tcpip.sys, afd.sys, and winsock components, and replace the current application with other software.

Reference:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store