Analysis of UDP packet loss problem in Linux system

Alibaba Cloud
9 min readJan 21, 2022

--

Recent work encountered a server application UDP packet loss, in the process of reviewing a lot of information, summed up this article, for more people to refer to.

Before we get started, we’ll use a graph to explain the process of receiving network messages from a Linux system.

  1. First, the network message is sent to the NIC via a physical cable
  2. The network driver reads the messages from the network into the ring buffer, which uses DMA (Direct Memory Access) and does not require CPU involvement
  3. The kernel reads the message from the ring buffer, executes the logic of the IP and TCP/UDP layer, and finally puts the message into the application’s socket buffer.
  4. Application reads packets from socket buffer for processing

In the process of receiving the UDP message, any process in the diagram may discard the message either actively or passively, so the packet loss may occur in the network card and driver, and it may occur in the system and application.

The reason for not analyzing the sending data flow is that the sending process is similar to receiving, only in the opposite direction, and the sending process message is less likely to be lost than received, only if the message rate sent by the application is greater than the kernel and the network card processing rate.

This article assumes that the machine has only one name foreth0the interface, if there are multiple interface or interface names are not eth0, please follow the actual situation to analyze.

Note: The text appearsRX(receive) indicates the receiving message,TX(transmit) indicates the sending message.

Confirm that a UDP packet packet has occurred

To see if the network card has dropped packets, you can use theethtool -S eth0view, find in the outputbadordropthe corresponding field whether there is data, under normal circumstances, the number of these fields should be 0. If you see that the corresponding number is growing, it indicates that the NIC has dropped packets.

Another command to view the packet drop data isifconfigthat it has statistics on its outputRX(receive received messages) andTX(transmit):

~# ifconfig eth0
...
RX packets 3553389376 bytes 2599862532475 (2.3 TiB)
RX errors 0 dropped 1353 overruns 0 frame 0
TX packets 3479495131 bytes 3205366800850 (2.9 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
...

In addition, the Linux system also provides packet drop information for each network protocol, which can benetstat -sviewed using commands, plus — udpthe ability to see only UDP-related packet data:

[[email protected] GOD]# netstat -s -u
IcmpMsg:
InType0: 3
InType3: 1719356
InType8: 13
InType11: 59
OutType0: 13
OutType3: 1737641
OutType8: 10
OutType11: 263
Udp: 517488890 packets received 2487375 packets to unknown port received. 47533568 packet receive errors 147264581 packets sent 12851135 receive buffer errors 0 send buffer errors
UdpLite:
IpExt:
OutMcastPkts: 696
InBcastPkts: 2373968
InOctets: 4954097451540
OutOctets: 5538322535160
OutMcastOctets: 79632
InBcastOctets: 934783053
InNoECTPkts: 5584838675

For the above output, follow the information below to view the UDP packet loss scenario:

  • packet receive errorsis not empty and has been growing to indicate that the system has UDP drops
  • packets to unknown port receivedIndicates that the destination port where the UDP message received by the system is not being used for monitoring, generally the service is not started, and does not cause serious problems
  • receive buffer errorsIndicates the number of packets dropped because the receive cache for UDP is too small

Note: The problem is not that the number of drops is not zero, for UDP, if a small number of drops is likely to be expected behavior, such as packet loss rate (packet loss/number of packets received) at one out of 10,000 or even lower.

Network card or driver packet loss

Previously, ifethtool -S eth0there is a problem with therx_***_errorsnetwork card, causing the system to drop packets, you need to contact the server or network card provider for processing.

# ethtool -S eth0 | grep rx_ | grep errors
rx_crc_errors: 0
rx_missed_errors: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
rx_errors: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0

netstat -iwill also provide the connection of each network card and packet loss situation, normally the output error or drop should be 0.

If the hardware or driver is not a problem, the general network card drops because the set buffer (ring buffer) is too small, you can useethtoolthe command to view and set the network card ring buffer.

ethtool -gYou can view the ring buffer for a network card, such as the following example

# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256

Pre-set represents the maximum ring buffer value for the NIC, which can be used toethtool -G eth0 rx 8192set its value.

Linux System Packet Loss

Linux system drops a lot of reasons, the common is: UDP message error, firewall, UDP buffer size is not enough, the system load is too high, the reasons for these drops are analyzed here.

UDP Message Error

If the UDP message is modified during transmission, it can result in a checksum error, or a length error, Linux will verify this when it receives a UDP message and discard the message once the error is invented.

If you want UDP message checksum to be sent to the application in a timely manner, you can disable the UDP checksum check by using the socket parameter:

int disable = 1;
setsockopt(sock_fd, SOL_SOCKET, SO_NO_CHECK, (void*)&disable, sizeof(disable)

Firewall

If the system firewall drops, the performance of the behavior is generally all UDP packets are not properly received, of course, do not rule out the firewall only drop a portion of the possibility of the message.

If you are experiencing a very large drop rate, check your firewall rules to ensure that the firewall does not actively drop UDP packets.

UDP buffer size is insufficient

After receiving the message, the Linux system will save the message to the buffer. Because the size of the buffer is limited, if a UDP message is too large (exceeding the buffer size or MTU size), the rate at which the message is received is too fast, it can cause Linux to drop packets directly because the cache is full.

At the system level, Linux sets the maximum value that can be configured for receive buffer, which can be viewed in the following file, typically Linux setting an initial value based on the memory size at startup.

  • /proc/sys/net/core/rmem_max: Allowed to set the receive buffer maximum value
  • /proc/sys/net/core/rmem_default: Default receive buffer value used
  • /proc/sys/net/core/wmem_max: Allow setting of the send buffer maximum value
  • /proc/sys/net/core/wmem_dafault: Default send buffer maximum value used

However, these initial values are not intended to deal with high-traffic UDP packets, and if the application receives and sends a very large number of UDP packets, it needs to be said that this value is larger. You can use thesysctlcommand to make it effective immediately:

sysctl -w net.core.rmem_max=26214400 # Set to 25M

You can also modify the/etc/sysctl.confcorresponding parameters to keep the parameters in effect the next time you start.

If the message is too large, the data can be segmented on the sender to ensure that the size of each message is within the MTU.

Another parameter that can be configured isnetdev_max_backlogthat it represents the number of messages that can be cached by the Linux kernel after it reads a message from the NIC driver, by default 1000, which can be set to a value such as 2000:

sudo sysctl -w net.core.netdev_max_backlog=2000

System load is too high

System CPU, memory, IO load is too high can cause network drops, such as the CPU if the load is too high, the system does not have time for the checksum calculation of the message, copy memory, etc., resulting in a network card or socket buffer out of the packet, memory load is too high, The application is too slow to process the packet, the IO load is too high, the CPU is used to respond to IO wait, and there is no time to process the UDP packets in the cache.

The Linux system itself is an interconnected system, and any problem with one component can affect the normal operation of other components. It is either an application problem or insufficient system load for the system. For the former need to find timely, debug and repair, for the latter, but also to find and expand in time.

Apply Drop Packets

The system’s UDP buffer size is mentioned above, and the adjusted SYSCTL parameter is only the maximum allowable value for the system, and each application needs to set its own socket buffer size value when creating the socket.

The Linux system puts the received message into the socket buffer, and the application continuously reads the message from buffer. So here are two application-related factors that can affect whether the packet is dropped: the size of the socket buffer and the speed at which the application reads the message.

For the first question, you can set the size of the socket receive buffer when the application initializes the socket, such as the following code to set the socket buffer to 20MB:

uint64_t receive_buf_size = 20*1024*1024; //20 MB
setsockopt(socket_fd, SOL_SOCKET, SO_RCVBUF, &receive_buf_size, sizeof(receive_buf_size));

If you are not writing and maintaining a program, it is not even possible to modify the application code. Many applications will provide configuration parameters to adjust this value, please refer to the corresponding official document, if there is no configuration parameters available, only to the developer of the program to mention issue.

Obviously, increasing the receive buffer for your app will reduce the likelihood of packet loss, but will also cause your app to use more memory, so use caution.

Another factor is that the application reads the speed of the message in buffer, and for the application, the processing message should take the form of an asynchronous

Where did you leave the bag?

To learn more about which function the Linux system drops when it executes, you can use thedropwatchtool, which listens to the system for packet drops, and prints out the address of the function where the packet drops occurred:

# dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring 1 drops at tcp_v4_do_rcv+cd (0xffffffff81799bad) 10 drops at tcp_v4_rcv+80 (0xffffffff8179a620) 1 drops at sk_stream_kill_queues+57 (0xffffffff81729ca7) 4 drops at unix_release_sock+20e (0xffffffff817dc94e) 1 drops at igmp_rcv+e1 (0xffffffff817b4c41) 1 drops at igmp_rcv+e1 (0xffffffff817b4c41)

With this information, find the corresponding kernel code, you can know in which step the kernel dropped the message, as well as the approximate reason for packet loss.

In addition, you can use the Linux perf tool to listenkfree_skbfor events that call this function when the network message is discarded:

sudo perf record -g -a -e skb:kfree_skb
sudo perf script

On the use and interpretation of the perf command, there are many articles on the Internet to refer to.

Summarize

  • UDP itself is a non-connected and unreliable protocol, applicable to the occasional loss of messages and does not affect the status of the program, such as video, audio, games, monitoring and so on. Applications that require higher message reliability do not use UDP, it is recommended to use TCP directly. Of course, the application layer can also be retried, to ensure reliability
  • If you find that the server drops, first through monitoring to see if the system load is too high, first try to reduce the load and then see if the problem of packet loss disappears
  • If the system load is too high, UDP packet loss is not an effective solution. If the application is abnormal result in CPU, memory, IO too high, please locate the abnormal application and repair in time, if the resources are not enough, monitoring should be able to find and rapidly expand
  • For a large number of systems receiving or transmitting UDP packets, you can reduce the probability of packet loss by adjusting the socket buffer size of the system and program.
  • When processing UDP packets, the application should be asynchronous and not have too much processing logic between the two received packets.

Resources

  • Pivotal:network Troubleshooting Guide
  • What is UDP “packet receive errors” and “packets to unknown Port received”
  • Lost Multicast Packets Troubleshooting Guide
  • Splunk answers:udp Drops on Linux

Analysis of UDP packet loss problem in Linux system

Original Source:

--

--

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com