Understanding CPU Interrupts in Linux

Image for post
Image for post

In many cases, when overall CPU usage is low but performance is poor, it is usually the CPU core specified to handle interrupts is fully occupied. In this blog, we will be exploring in detail about interrupt settings and several case studies.

What Is an Interrupt?

In kernel 2.4 and later, Linux improves the capability of assigning specific interrupts to the specified processors (or processor groups). This is called the SMP IRQ affinity, which controls how the system responds to various hardware events. You can limit or redistribute the server workload so that the server can work more efficiently.

Let’s take NIC interrupts as an example. Without the SMP IRQ affinity, all NIC interrupts are associated with CPU 0. As a result, CPU 0 is overloaded and cannot efficiently process network packets, causing a bottleneck in performance.

After you configure the SMP IRQ affinity, multiple NIC interrupts are allocated to multiple CPUs to distribute the CPU workload and speed up data processing. The SMP IRQ affinity requires NICs to support multiple queues. A NIC supporting multiple queues has multiple interrupt numbers, which can be evenly allocated to different CPUs.

You can simulate a multi-queue mode for a single-queue NIC by means of receive packet steering (RPS) or receive flow steering (RFS), but the effect is poorer than that of a multi-queue NIC with RPS/RFS enabled.

What Are RPS and RFS?

Image for post
Image for post
Figure 1 Case with only RPS enabled (sourcing from the Internet)

RPS simply distributes data packets to different CPUs. This greatly degrades the utilization of CPU caches when different CPUs are used to run applications and handle soft interrupts. In this case, RFS ensures that one CPU is used to run applications and handle soft interrupts, to fully utilize the CPU caches. RPS and RFS are usually used together to achieve the best results. They are mainly intended for single-queue NICs in a multi-CPU environment.

Image for post
Image for post
Figure 2 Case with both RPS and RFS enabled (sourcing from the Internet)

The values of rps_flow_cnt and rps_sock_flow_entries are carried to the nearest power of 2. For a single-queue device, rps_flow_cnt is equal to rps_sock_flow_entries.

Receive flow steering (RFS), together with RPS, inserts data packets into the backlog queue of a specified CPU, and wakes up the CPU for execution.

IRQbalance is applicable to most scenarios. However, in scenarios requiring high network performance, you are recommended to bind interrupts manually.

IRQbalance can cause some issues during operation:

(a) The calculated value is sometimes inappropriate, failing to achieve load balancing among CPUs.

(b) When the system is idle and IRQs are in power-save mode, IRQbalance distributes all interrupts to the first CPU, to make other idle CPUs sleep and reduce energy consumption. When the load suddenly rises, performance may be degraded due to the lag in adjustment.

© The CPU specified to handle interrupts frequently changes, resulting in more context switches.

(d) IRQbalance is enabled but does not take effect, that is, does not specify a CPU for handling interrupts.

Viewing the Number of NIC Queues

# ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4

2. Let’s take ECS CentOs7.6 as an example. System processing interrupts are recorded in the /proc/interrupts file. This file is large by default and difficult to read. If there are many CPU cores, the reading experience is greatly compromised.

# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 141 0 0 0 IO-APIC-edge timer
1: 10 0 0 0 IO-APIC-edge i8042
4: 807 0 0 0 IO-APIC-edge serial
6: 3 0 0 0 IO-APIC-edge floppy
8: 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 IO-APIC-fasteoi acpi
10: 0 0 0 0 IO-APIC-fasteoi virtio3
11: 22 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1
12: 15 0 0 0 IO-APIC-edge i8042
14: 0 0 0 0 IO-APIC-edge ata_piix
15: 0 0 0 0 IO-APIC-edge ata_piix
24: 0 0 0 0 PCI-MSI-edge virtio1-config
25: 4522 0 0 4911 PCI-MSI-edge virtio1-req.0
26: 0 0 0 0 PCI-MSI-edge virtio2-config
27: 1913 0 0 0 PCI-MSI-edge virtio2-input.0
28: 3 834 0 0 PCI-MSI-edge virtio2-output.0
29: 2 0 1557 0 PCI-MSI-edge virtio2-input.1
30: 2 0 0 187 PCI-MSI-edge virtio2-output.1
31: 0 0 0 0 PCI-MSI-edge virtio0-config
32: 1960 0 0 0 PCI-MSI-edge virtio2-input.2
33: 2 798 0 0 PCI-MSI-edge virtio2-output.2
34: 30 0 0 0 PCI-MSI-edge virtio0-virtqueues
35: 3 0 272 0 PCI-MSI-edge virtio2-input.3
36: 2 0 0 106 PCI-MSI-edge virtio2-output.3
input0 indicates the network interrupt handled by the first CPU (CPU 0).
If there are multiple Alibaba Cloud ECS network interrupts, input.1, input.2, and input.3 are available.
......
PIW: 0 0 0 0 Posted-interrupt wakeup event

3. If the Alibaba Cloud ECS contains many CPU cores, the file is difficult to read. You can run the following command to find the cores specified to handle interrupts in the Alibaba Cloud ECS.

In the following example, four queues are configured on eight CPUs.

# for i in $(egrep "\-input."  /proc/interrupts |awk -F ":" '{print $1}');do cat /proc/irq/$i/smp_affinity_list;done
5
7
1
3

sar processing for copying:

# for i in $(egrep "\-input."  /proc/interrupts |awk -F ":" '{print $1}');do cat /proc/irq/$i/smp_affinity_list;done |tr -s '\n' ','
5,7,1,3,
#sar -P 5,7,1,3 1 Refresh the usage of CPUs 5, 7, 1, and 3 every second.
# sar -P ALL 1 Refresh the usage of all CPU cores every second when a few CPU cores are being monitored, so that we can determine whether slow processing is caused by insufficient queues.
Linux 3.10.0-957.5.1.el7.x86_64 (iZwz98aynkjcxvtra0f375Z) 05/26/2020 _x86_64_ (4 CPU)
05:10:06 PM CPU %user %nice %system %iowait %steal %idle
05:10:07 PM all 5.63 0.00 3.58 1.02 0.00 89.77
05:10:07 PM 0 6.12 0.00 3.06 1.02 0.00 89.80
05:10:07 PM 1 5.10 0.00 5.10 0.00 0.00 89.80
05:10:07 PM 2 5.10 0.00 3.06 2.04 0.00 89.80
05:10:07 PM 3 5.10 0.00 4.08 1.02 0.00 89.80
05:10:07 PM CPU %user %nice %system %iowait %steal %idle
05:10:08 PM all 8.78 0.00 15.01 0.69 0.00 75.52
05:10:08 PM 0 10.00 0.00 16.36 0.91 0.00 72.73
05:10:08 PM 1 4.81 0.00 13.46 1.92 0.00 79.81
05:10:08 PM 2 10.91 0.00 15.45 0.91 0.00 72.73
05:10:08 PM 3 9.09 0.00 14.55 0.00 0.00 76.36

sar tips:

Display the cores with the “idle” value less than 10.

sar -P 1,3,5,7 1 |tail -n+3|awk '$NF<10 {print $0}'

Check whether a single core is fully occupied. If yes, replace “1,3,5,7” with “ALL”.

sar -P ALL 1 |tail -n+3|awk '$NF<10 {print $0}'

The following example shows configurations with four cores and a memory capacity of 8 GB (ecs.c6.xlarge):

In this example, four queues are configured on four CPUs, but interrupts are handled by CPUs 0 and 2 by default.

# grep -i "input" /proc/interrupts
27: 1932 0 0 0 PCI-MSI-edge virtio2-input.0
29: 2 0 1627 0 PCI-MSI-edge virtio2-input.1
32: 1974 0 0 0 PCI-MSI-edge virtio2-input.2
35: 3 0 284 0 PCI-MSI-edge virtio2-input.3
# for i in $(egrep "\-input." /proc/interrupts |awk -F ":" '{print $1}');do cat /proc/irq/$i/smp_affinity_list;done
1
3
1
3

This is feasible because hyper-threading is enabled for physical CPUs and each vCPU is bound to a hyper-thread of a physical CPU.

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1

4. Disable IRQbalance.

# service irqbalance status
Redirecting to /bin/systemctl status irqbalance.service
● irqbalance.service - irqbalance daemon
Loaded: loaded (/usr/lib/systemd/system/irqbalance.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Wed 2020-05-27 14:39:28 CST; 2s ago
Process: 1832 ExecStart=/usr/sbin/irqbalance --foreground $IRQBALANCE_ARGS (code=exited, status=0/SUCCESS)
Main PID: 1832 (code=exited, status=0/SUCCESS)
May 27 14:11:40 iZbp1ee4vpiy3w4b8y2m8qZ systemd[1]: Started irqbalance daemon.
May 27 14:39:28 iZbp1ee4vpiy3w4b8y2m8qZ systemd[1]: Stopping irqbalance daemon...
May 27 14:39:28 iZbp1ee4vpiy3w4b8y2m8qZ systemd[1]: Stopped irqbalance daemon.

5. Configure RPS manually.

5.1 Understand the following file before you configure RPS manually (IRQ_number is the serial number obtained by the grep input command).

Go to /proc/irq/${IRQ_number}/ and read the smp_affinity and smp_affinity_list files.

The smp_affinity file is in the bitmask and hexadecimal format.

The smp_affinity_list file is in the decimal format and easy to read.

Modification of either file is synchronized to the other file.

For ease of understanding, let’s look at the smp_affinity_list file in the decimal format.

If you are not clear about this step, refer to the previous output of /proc/interrupts.

# for i in $(egrep "\-input."  /proc/interrupts |awk -F ":" '{print $1}');do cat /proc/irq/$i/smp_affinity_list;done
1
3
1
3

You can run the echo command to change the CPU specified to handle interrupts. In the following example, interrupt 27 is allocated to CPU 0 for processing. Generally, you are recommended to leave CPU 0 unused.

# echo 0 >> /proc/irq/27/smp_affinity_list
# cat /proc/irq/27/smp_affinity_list
0

Bitmask:

“f” is a hexadecimal value corresponding to the binary value of “1111”. (When the bits for four CPUs are set to f, all CPUs are used to handle interrupts.)

Each bit in the binary value represents a CPU on the server. A simple demo is shown below:
CPU ID Binary Hexadecimal
CPU 0 0001 1
CPU 1 0010 2
CPU 2 0100 4
CPU 3 1000 8

5.2 Configure each queue of a NIC separately. For example, specify queue 0 for eth0:

echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpusThis method is similar to the method used to configure the interrupt affinity. A mask is used, and is set to a value indicating that all CPUs can be used to handle interrupts. For example:
4core, f
8core, ff
16core, ffff
32core, ffffffff
Queue 0 is configured on CPU 0 by default.
# cat /sys/class/net/eth0/queues/rx-0/rps_cpus
0
# echo f >>/sys/class/net/eth0/queues/rx-0/rps_cpus
# cat /sys/class/net/eth0/queues/rx-0/rps_cpus
f

6. Configure RFS.

Configure the following two items:

6.1 Control the number of entries in the global table (rps_sock_flow_table) by using a kernel parameter:

# sysctl -a |grep net.core.rps_sock_flow_entries
net.core.rps_sock_flow_entries = 0
# sysctl -w net.core.rps_sock_flow_entries=1024
net.core.rps_sock_flow_entries = 1024

6.2 Specify the number of entries in the hash table of each NIC queue:

#  cat /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
0
# echo 256 >> /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
# cat /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
256

To enable RFS, configure both items.

We recommend that the sum of rps_flow_cnt from all NIC queues on the machine be less than or equal to rps_sock_flow_entries. Four queues are available and the number of entries in each queue is set to 256, which can be increased as required.

Case Study: Adjustment in Stress Testing

1. Configure RPS where there are 32 cores and 16 queues. Set the mask to ffffffff.

2. RFS 65536 /16 = 4096

3. smp_affinity_list 1,3,5,7,9….31

However, the interrupt load was still unbalanced. Considering the cache hits with RFS enabled, we increased the number of threads on the client to 16. Then interrupt processing tended to be stable, but the traffic still fluctuated.

4. Disable TSO. The traffic fluctuated between 7.9x and 8.0x and tended to be stable.

With the above settings, we can do some fine-tuning tests in stress testing to observe the service, and check whether CPU interrupts of the ongoing instance in the production environment are handled in a balanced manner.

Case Study: CPU Interrupt Processing Anomaly of a Customer

We performed troubleshooting as follows:

1. Check the captured packet files of the proxy servers provided by the customer. The files showed that the peer end returned no packets after syn packets were retransmitted, indicating a possible anomaly at the peer end.

Image for post
Image for post

2. Capture packets on the server. The results showed that the server received no retransmitted packets, indicating a possible anomaly in the source instance.

3. Capture packets on the physical machine. The results showed that the physical machine received no retransmitted packets, indicating a possible anomaly in the system.

4. Log in to the system to check the interrupt settings. In this instance, eight cores and eight queues were configured. Cores 0, 2, 4, and 6 were enabled to handle the eight interrupts by default.

Image for post
Image for post

5. Check the number of interrupts handled on each core. We found a large difference in the numbers and an anomaly with interrupt 6 of input 6, which remained unchanged for a long time under monitoring.

Image for post
Image for post

6. Check the system logs for exception logs of input 6. As expected, an exception alert was found, indicating a suspected qemu-kvm bug.

Image for post
Image for post

7. Restart and recover the system. Attempt to bind the interrupts to different cores by referring to the preceding settings.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store