Windows Networking Troubleshooting 7: Network Connectivity Debugging (TrackNblOwner Principle)

We will be sharing a debugging process for sending and receiving network data packets.

Background

Recently, a problem occurred that Windows Server 2012 R2 probabilistically lost network connectivity after opening the user program.

The superficial phenomenon is that the server could not be pinged. When the packet was caught, it was found that the ICMP Echo Request had actually been received, and the server returned the ICMP Echo Reply, but then no further response was found. At first glance, it was an operating system problem. In addition, another phenomenon exists. During the troubleshooting process, we tried to disable and enable the network interface controller (NIC), but found that the system had not responded to the “enable” operation after disabling the NIC.

Analysis

According to previous experience, for problems such as disabling and enabling failure, it is most likely that unsent packets in the system has not finally completed, and the system PNP has been waiting for the Reference count of the NIC to be released. At this time, if the system dump is captured, this kind of call stack will generally be seen.

Image for post

If we have the private symbol of tcpip.sys, it is easy to see if a pending packet exists through the rundown lock of tcpip.

Normal data packet transmission and call of completion procedure

Image for post

At present, even if we know that the system has pending packets, how do we know the causes of these pending packets? This requires the NDIS TrackNblOwner, which is available in versions later than Windows 7.

TrackNblOwner Principle

For an introduction to TrackNblOwner, while there isn’t much online, you can see the following two MSFT blogs:
https://blogs.msdn.microsoft.com/ndis/2013/12/27/ndiskd-nbl-log/

https://blogs.msdn.microsoft.com/ndis/2013/12/20/ndiskd-pendingnbls/

https://blogs.msdn.microsoft.com/ntdebugging/2013/03/29/debugging-a-network-connectivity-issue-tracknblowner-to-the-rescue/

These documents introduce the usage of TrackNblOwner, but the principles of TrackNblOwneraren’t covered in depth. Here is a brief introduction to the principles:

TrackNblOwner is enabled by the registry HKLMSYSTEMCurrentControlSetServicesNDISParameters ! TrackNblOwner when the NDIS driver is loaded.

Image for post

For systems configured with Kernel Debug, after attaching with Windbg, it can be enabled and disabled directly.

Image for post

Therefore, by setting the Read/Write access breakpoint of TrackNblOwner pointer, we can clearly understand the operating system behavior after TrackNblOwner takes effect.

Image for post

Let’s understand the behavior simply by assembling the code. Basically, we can see that TrackNblOwner checks whether NDIStrackNBLowner is enabled is enabled in all NDIS-driven APIs related to NBL operations. If so, before the corresponding driver processes NBL, the pointer of the current Filter/Protocol/Miniport driver is put into the NblCurrentOwner structure in the NBL NetBufferListInfo.

When troubleshooting the problem of sending and receiving data packets at the front and back ends, we can enable TrackNblOwner, so that it is easy to find the Net Buffer List waiting to be processed through the Pending NBL search of ndiskd, and find the name of the driver that has not been processed yet through the NBL structure, and accurately locate the driver that has the problem.

At the same time, TrackNblOwner is originally a write pointer and has no memory loss in performance. Although a few additional commands are added to operate the data pointer, these losses can be basically ignored in general systems.

In this case, since the problem is stable and reproducible, by analyzing the dump captured after TrackNblOwner is enabled, it is confirmed that the data was waiting for the NIC driver to process at that time, while the NIC did not receive the completion message from virtio at the front and back ends, and no completion message was sent all the time, causing the network to appear to be interrupted.

Image for post

Through the analysis of each Pending NBL, we can see that the ICMP Echo reply and ARP Request are waiting to be sent. This also explains the situation seen in the previous captured packet. Since the ARP cache is still valid at the beginning, the system can normally respond to the ICMP Echo reply. However, as the ARP cache fails and the ARP Request is not correctly responded to, sending the ICMP Echo reply will fail, because the destination MAC address cannot be filled in during the generation of the ICMP Echo reply.

Original Source

https://www.alibabacloud.com/blog/windows-networking-troubleshooting-7-network-connectivity-debugging-tracknblowner-principle_594997?spm=a2c41.13103575.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store