Windows Networking Troubleshooting 2: Unable to Stop Apache

This is the second article of the Windows Networking troubleshooting series. The previous article describes the Windows NDIS architecture and how to troubleshoot network problems under this architecture. In this article, we use a case to explain the implementation of Windows TCP/IP in NDIS.

Problem

A service was deployed on Windows Server 2008 R2 SP1 by using Apache. When we tried to stop this Apache service, it continued to be in the pending or stopping status. No operations could fix this problem, until we restarted the machine.

Recovery Plan

We have never previously heard of any known issues that prevent the Apache service from being stopped. It seems that this problem is caused by the application itself. However, we still provide the following empirical suggestions just in case:

1. Uninstall unnecessary third-party software, especially security software where the Filter driver or the WFP callout driver has been added.
2. Disable the advanced features of the network interface controller (NIC), especially TCP Chimney and RSS. Reference: https://blogs.technet.microsoft.com/onthewire/2014/01/21/tcp-offloadingchimney-rsswhat-is-it-and-should-i-disable-it/
3. Check the Windows patch version and install the latest patch.

After trying these steps, the problem still existed. At the same time, we confirmed that the patch was the latest version. So far, all our general methods have failed to solve this problem. We have to reproduce the problem and capture the memory dump for further analysis.

Memory Dump Analysis

From the dump file, we can clearly see that the httpd.exe process of the Apache service does not exit because the Afd.sys driver is still waiting for a completion signal.

Image for post
Image for post

Because the AFD resource cannot be released, the application continues to wait. Even if we kill the application, a zombie process exists. We have to restart the machine.

We know that the Windows AFD resource is strongly associated with the TCP resource. tcpip.sys will only invoke the afd.sys callback routine to release resources and trigger the signal after the corresponding TCP resource is released.

Image for post
Image for post

Therefore, the more important issue is why the TCP resource hasn’t been released. To figure this out, we directly check the TCP resource reference.

Image for post
Image for post

Basically, resource object management has been implemented in Windows. A TCP port is also an object. Before operations are performed on each object, the system will try AddReference to avoid memory access violations caused by the release of that object when it is being used. After the object is used, DeReference is invoked to reduce the corresponding reference count. Once the reference count of an object is 0, the corresponding routine will release that object. For a TCP listening port, the resource to be released is tcpip! TcpDereferenceListener, for example:

In this case, the TCP resource corresponding to port 80 obviously has more than 0x36 references. Besides TcpCreateListener, the other 0x35 references may be reference leaks or references that other drivers or software has added when they perform operations on this structure. For example, netstat will increase the reference to a port when it enumerates port information:

Image for post
Image for post

Since we have confirmed that no network-related third-party software is installed in the system, we can basically reach the conclusion that the problem is caused by TCP resource leaks on the operating system. At this point, we usually need to open a case with Microsoft to further analyze the operating system problems.

Postscript

Just when we were ready to open a case with Microsoft for further analysis, we coincidentally found that the latest patch (released on July 10) may cause w3svc to hang. Although it is not related to Apache, the problem is essentially the same.

https://support.microsoft.com/en-us/help/4338818/windows-7-update-kb4338818

Microsoft later released update 4345459 to fix that problem.

https://support.microsoft.com/en-us/help/4345459/stop-error-0xd1-after-a-race-condition-occurs-in-windows-7-service-pac

In our case, the problem was solved after the patch was applied.

Reference:https://www.alibabacloud.com/blog/windows-networking-troubleshooting-2-unable-to-stop-apache_594838?spm=a2c41.12911133.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store