Optimizing Network Deployment for Tolerance and Latency Sensitivity Services

Image for post
Image for post

By Yuebin

Preface

This article describes a case where a client encounters occasional time-out during the Redis cluster query and subsequently proposes the diagnosis approach, troubleshooting method, and best practices for improving service reliability at the network layer.

Problematic Environment

Image for post
Image for post

In this topology, a Redis cluster is built based on Elastic Compute Service (ECS) instances and configured with Redis Proxy through Codis at the front end wherein Proxy is used in description for commonality purposes. Redis Proxy is attached to the backend of a Server Load Balancer (SLB) instance, which provides a single entry for service provisioning.

Symptom

1) The number of time-out errors is proportional to the service volume. The resource usage of SLB and ECS instances is low even outside the service peak period.

2) A burst of time-out errors may occur.

Diagnosis Approach

  • Resource Information: It describes the topology of the problematic system, which involves applications, hosts, forwarding devices, and links. For effective troubleshooting, It’s critical to fully understand the roles that these resources assume in the topology.
  • Environmental Information: It is the information required for describing the problem, and includes the error logs, time and frequency of problem occurrence, and time-out period set at the application layer.

After understanding the resource information and environmental information, it’s easy to pinpoint the problem in specific resources, collect related information based on the problem definition, and identify the root cause by using a data-oriented method of exclusion during problem interpretation and analysis. This enables efficient and accurate troubleshooting.

Previously, we have discussed resource information as related to the topology of the problematic system. Now, let’s explore environmental information, which includes the configured time-out period (50 ms), client-reported error (“read timeout”, excluding TCP three-way handshake time-out), and error reporting frequency (10 errors per hour during non-service peak hours and hundreds of errors per hour during service peak hours). In occasional cases (one to two times of occurrence per week), hundreds of “read timeout” and “connect timeout” errors are reported during service peak hours and non-service peak hours. According to the customer’s Redis check results, the query time was always less than 10 ms, and Redis Proxy recorded no forwarding logs.

Troubleshooting

tcpdump -i <接口|any> -C <每文件大小> -W <文件个数> -w <保存文件名> 抓包过滤条件

The preceding command is used to capture packets on a specific interface based on filter conditions. The captured packets are saved to files whose filenames contain the specified prefix. The maximum occupied disk space is equal to the size of each file multiplied by the number of files, which are cyclically overwritten. By enabling cyclic packet capture, exchanged packets are captured when the client reports an error.

The captured packet files are opened in Wireshark. The following snippet shows how to filter numerous large-sized packet files captured during cyclic packet capture.

//在安装了wireshark的电脑上都会有capinfos和tshark两个命令,以笔者使用的macOS为例
~$ capinfos -a -e *cap //使用capinfos查看抓包文件的其实时间和结束时间,选取包含报错时间+-超时时间的文件,其他文件就不需要了
File name: colasoft_packets.cap
Packet size limit: inferred: 66 bytes - 1518 bytes (range)
First packet time: 2019-06-12 09:00:00.005519934
Last packet time: 2019-06-12 09:59:59.998942048
File name: colasoft_packets1.cap
Packet size limit: inferred: 60 bytes - 1518 bytes (range)
First packet time: 2019-06-12 09:00:00.003709451
Last packet time: 2019-06-12 09:59:59.983532957
//如果依然有较多文件,则可以使用tshark命令进行筛选。比如报错中提到Redis查询一个key超时,则可以用以下脚本找到这次查询请求具体在哪个文件中:
~$ for f in ./*; do echo $f; tshark -r $f 'tcp.payload contains "keyname"'; done

Identify the request, open the file in Wireshark, locate the data packets, and track the stream to identify 5-tuple information and context interaction of the entire stream.

A comparison of the packets captured on the client, Redis Proxy, and Redis shows a period of more than 100 ms from when the client sends a request to when it receives a response. Much of that period is consumed by Redis resending a response packet to Redis Proxy due to packet loss. The following figure shows the request-response flowchart.

Image for post
Image for post

The packet loss indicated by captured packets does not occur on physical links based on the internal monitoring results of Alibaba Cloud. However, on the ECS instance that hosts Redis Proxy, the number of lost packets of the front-end and back-end queues increases in proportion to the service time-out frequency when the back-end drive sends packets to the front-end drive at the virtualized layer. Further troubleshooting indicates that the operating system of the ECS instance does not enable the multi-queue network interface controller (NIC) feature, and therefore only one CPU is responsible for processing NIC interrupts. As a result, the CPU fails to promptly process NIC interrupts in the case of traffic bursts, causing front-end and back-end queue accumulation and packet loss due to queue overflow.

To solve this problem, we recommend that the customer enable the multi-queue NIC feature and bind the CPU affinities of various NIC queue interrupts to different CPUs. For Alibaba Cloud ECS, the available NIC queues are bound to the instance type. For more information, refer to the ECS instance type document. For more information about how to enable the multi-queue NIC feature and automatically schedule the CPU affinities of NIC queue interrupts by using irqbalance, see the Alibaba Cloud documentation.

Enabling the multi-queue NIC feature and irqbalance service resolves the problem of hourly access time-out. However, many time-out errors still occur in bursts every few days. According to the customer-collected error messages and the underlying network monitoring results of Alibaba Cloud, the problem of time-out error bursts is due to cross-zone link jitters at the underlying layer of Alibaba Cloud.

Each zone of Alibaba Cloud can be regarded as a data center, and different zones can provide local disaster recovery for each other. Zones must be physically isolated from each other to prevent cross-zone fault spreading, and must be interconnected through local optical cables for mutual access.

The local optical cables that connect zones are much less reliable than fiber jumpers used inside data centers and are prone to link interruption due to road construction and quality degradation. Considering service continuity, Alibaba Cloud provides sufficient redundant links and uses technologies such as transmission failover and route switching to ensure that some cross-zone links automatically converge upon faults. However, packet loss persists during the switching process. According to the underlying monitoring results of Alibaba Cloud, about 1% of packets are lost over a period of 3 to 5 seconds when a cross-zone link is interrupted. The packet loss rate depends on the proportion of interrupted links to the total links. This may cause time-out errors reported for some latency-sensitive services for nearly 1 minute. The time-out error bursts mentioned previously are attributed to these latency-sensitive services.

If zones that host useful resources are highly scattered, cross-zone link jitters are more likely to affect services. For example, if the ECS instances at the client’s end are located in Zones A and B, the SLB instance in Zone C, Redis Proxy in Zone D, and Redis in Zone E, then jitters of the links from A to C, from B to C, from C to D, and from D to E may affect the entire system.

Best Practices

Best Practice for Host Networks

Enable the multi-queue NIC feature and scatter NIC software interrupts for optimal network performance. The following general recommendations are available for ensuring stable network performance:

  • Use VPC instances, VPC provides a range of benefits, such as network tenant isolation, private lines, VPN gateways, and underlying forwarding capabilities much more robust than those provided by classic-network instances. The new-generation instances are implemented based on VPC, so they provide a more powerful network forwarding capability.
  • Use dedicated instances, which employ the technology of strict resource isolation to protect virtual machines from neighbor interference.
  • Enable the multi-queue NIC feature and bind the CPU affinities of NIC software interrupts so that each NIC queue is processed by a specific CPU. This improves network processing performance.
  • Bind multiple NIC queues to dedicated CPUs and bind other processes to other CPUs so that NIC software interrupts are processed by dedicated CPUs. This applies to forwarding-only services that require extremely high network performance.
//绑定网卡软中断的方法:
//1. 首先看cat /proc/interrupts | grep virtio,在阿里云提供的标准操作系统中,virtio0是网卡队列
~$cat /proc/interrupts | grep virtio
//omit outputs
31: 310437168 0 0 0 PCI-MSI-edge virtio0-input.0
32: 346644209 0 0 0 PCI-MSI-edge virtio0-output.0
//将第一列的中断号记录下来,修改下面的文件绑定CPU亲和性
echo <cpu affinity mask> /proc/irq/{irq number}/smp_affinity
//具体CPU affinity mask可以参考manpage https://linux.die.net/man/1/taskset,这里不再说明。

Best Practice for Physical Networks

Take service tolerance and latency sensitivity into account for deploying services.

  • In terms of service tolerance, if packet loss occurs over TCP connections, in the worst case, these packets are resent only after a retransmission timeout (RTO). RTO is applicable to the tail drop scenario, whereas the fast retransmit mechanism is applicable to other scenarios. The minimum value of RTO is defined as 200 Hz, that is, 200 ms, in the kernel. In low-latency scenarios such as internal or local mutual access, the worst case is that a lost packet is resent after 200 ms. Therefore, you must set the request time-out period to a value greater than 200 ms for critical services to allow TCP retransmission. For non-critical services that do not focus on whether results are returned during a query, set the request time-out period to a smaller value for protecting the entire system. Therefore, service tolerance indicates the ability of services to tolerate errors or retransmission.
  • In terms of latency sensitivity, we recommend that latency-sensitive services be called within a zone to minimize the impact of faulty cross-zone links on these services. Though different zones provide the same set of services, services are not often called across zones. For example, Redis may be called by the webserver layer in the local zone to use its cache service. Databases in other zones are queried only in few cases where no cache hits are found. The impact of cross-zone link jitters is minimized through technologies such as read-only instances.

Summary

The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store