Optimizing Network Deployment for Tolerance and Latency Sensitivity Services

9 min readJun 3, 2020

By Yuebin

Preface

As the infrastructure provided at the underlying layer of Alibaba Cloud, the internal physical networks and many networking products are not user-friendly at the data plane and, in a certain sense, constitute a black box. In traditional Internet data centers (IDCs), services are separated from physical networks. Therefore, challenges such as service lagging, latency, and disconnection frequently attribute to network issues. Among these problems, the toughest are those with the extremely occasional occurrence and significantly low reproduction rate. Resolving these challenges is all the more prominent in the network sector, where not all data packets are logged, considering performance and resource consumption control. In the case of occasional time-out logged at the application layer, the network layer usually has no specific log about packet exchange during the involved scheduling process at the application layer. This makes troubleshooting extremely complex.

This article describes a case where a client encounters occasional time-out during the Redis cluster query and subsequently proposes the diagnosis approach, troubleshooting method, and best practices for improving service reliability at the network layer.

Problematic Environment

The problem to discuss lies in a submodule of an interactive web application for Redis query. Simply put, the submodule is used to query live comments on a video website with a bullet screen. The submodule has a simple topology, as shown in the following figure.

In this topology, a Redis cluster is built based on Elastic Compute Service (ECS) instances and configured with Redis Proxy through Codis at the front end wherein Proxy is used in description for commonality purposes. Redis Proxy is attached to the backend of a Server Load Balancer (SLB) instance, which provides a single entry for service provisioning.

Symptom

The client that accesses the on-premises Redis system reports the time-out error from time to time, with a higher probability as compared to the scenario when services are hosted in the original IDC but still within the acceptable range. The time-out error typically occurs in the following situations:

1) The number of time-out errors is proportional to the service volume. The resource usage of SLB and ECS instances is low even outside the service peak period.

2) A burst of time-out errors may occur.

Diagnosis Approach

To solve this problem, we must first understand the context and environment of the problem. Let’s divide the information required during the diagnosis process into the following two types to ensure collecting complete information in an ordered way.

Resource Information: It describes the topology of the problematic system, which involves applications, hosts, forwarding devices, and links. For effective troubleshooting, It’s critical to fully understand the roles that these resources assume in the topology.
Environmental Information: It is the information required for describing the problem, and includes the error logs, time and frequency of problem occurrence, and time-out period set at the application layer.

After understanding the resource information and environmental information, it’s easy to pinpoint the problem in specific resources, collect related information based on the problem definition, and identify the root cause by using a data-oriented method of exclusion during problem interpretation and analysis. This enables efficient and accurate troubleshooting.

Previously, we have discussed resource information as related to the topology of the problematic system. Now, let’s explore environmental information, which includes the configured time-out period (50 ms), client-reported error (“read timeout”, excluding TCP three-way handshake time-out), and error reporting frequency (10 errors per hour during non-service peak hours and hundreds of errors per hour during service peak hours). In occasional cases (one to two times of occurrence per week), hundreds of “read timeout” and “connect timeout” errors are reported during service peak hours and non-service peak hours. According to the customer’s Redis check results, the query time was always less than 10 ms, and Redis Proxy recorded no forwarding logs.

Troubleshooting

Further information needs to be collected as all available logs are only related to both ends (client and Redis) of the system. Packet capture is the most direct and effective approach to handle time-out errors. Disk storage space may quickly run out if packet capture is always on, considering the relatively low frequency of problem occurrence and the heavy traffic of the entire system. To address this, run the following command to capture packets cyclically:

tcpdump -i <接口|any> -C <每文件大小> -W <文件个数> -w <保存文件名> 抓包过滤条件

The preceding command is used to capture packets on a specific interface based on filter conditions. The captured packets are saved to files whose filenames contain the specified prefix. The maximum occupied disk space is equal to the size of each file multiplied by the number of files, which are cyclically overwritten. By enabling cyclic packet capture, exchanged packets are captured when the client reports an error.

The captured packet files are opened in Wireshark. The following snippet shows how to filter numerous large-sized packet files captured during cyclic packet capture.

//在安装了wireshark的电脑上都会有capinfos和tshark两个命令，以笔者使用的macOS为例
~$ capinfos -a -e *cap //使用capinfos查看抓包文件的其实时间和结束时间，选取包含报错时间+-超时时间的文件，其他文件就不需要了
File name:           colasoft_packets.cap
Packet size limit:   inferred: 66 bytes - 1518 bytes (range)
First packet time:   2019-06-12 09:00:00.005519934
Last packet time:    2019-06-12 09:59:59.998942048File name:           colasoft_packets1.cap
Packet size limit:   inferred: 60 bytes - 1518 bytes (range)
First packet time:   2019-06-12 09:00:00.003709451
Last packet time:    2019-06-12 09:59:59.983532957//如果依然有较多文件，则可以使用tshark命令进行筛选。比如报错中提到Redis查询一个key超时，则可以用以下脚本找到这次查询请求具体在哪个文件中：
~$ for f in ./*; do echo $f; tshark -r $f 'tcp.payload contains "keyname"'; done

Identify the request, open the file in Wireshark, locate the data packets, and track the stream to identify 5-tuple information and context interaction of the entire stream.

A comparison of the packets captured on the client, Redis Proxy, and Redis shows a period of more than 100 ms from when the client sends a request to when it receives a response. Much of that period is consumed by Redis resending a response packet to Redis Proxy due to packet loss. The following figure shows the request-response flowchart.

The packet loss indicated by captured packets does not occur on physical links based on the internal monitoring results of Alibaba Cloud. However, on the ECS instance that hosts Redis Proxy, the number of lost packets of the front-end and back-end queues increases in proportion to the service time-out frequency when the back-end drive sends packets to the front-end drive at the virtualized layer. Further troubleshooting indicates that the operating system of the ECS instance does not enable the multi-queue network interface controller (NIC) feature, and therefore only one CPU is responsible for processing NIC interrupts. As a result, the CPU fails to promptly process NIC interrupts in the case of traffic bursts, causing front-end and back-end queue accumulation and packet loss due to queue overflow.

To solve this problem, we recommend that the customer enable the multi-queue NIC feature and bind the CPU affinities of various NIC queue interrupts to different CPUs. For Alibaba Cloud ECS, the available NIC queues are bound to the instance type. For more information, refer to the ECS instance type document. For more information about how to enable the multi-queue NIC feature and automatically schedule the CPU affinities of NIC queue interrupts by using irqbalance, see the Alibaba Cloud documentation.

Enabling the multi-queue NIC feature and irqbalance service resolves the problem of hourly access time-out. However, many time-out errors still occur in bursts every few days. According to the customer-collected error messages and the underlying network monitoring results of Alibaba Cloud, the problem of time-out error bursts is due to cross-zone link jitters at the underlying layer of Alibaba Cloud.

Each zone of Alibaba Cloud can be regarded as a data center, and different zones can provide local disaster recovery for each other. Zones must be physically isolated from each other to prevent cross-zone fault spreading, and must be interconnected through local optical cables for mutual access.

The local optical cables that connect zones are much less reliable than fiber jumpers used inside data centers and are prone to link interruption due to road construction and quality degradation. Considering service continuity, Alibaba Cloud provides sufficient redundant links and uses technologies such as transmission failover and route switching to ensure that some cross-zone links automatically converge upon faults. However, packet loss persists during the switching process. According to the underlying monitoring results of Alibaba Cloud, about 1% of packets are lost over a period of 3 to 5 seconds when a cross-zone link is interrupted. The packet loss rate depends on the proportion of interrupted links to the total links. This may cause time-out errors reported for some latency-sensitive services for nearly 1 minute. The time-out error bursts mentioned previously are attributed to these latency-sensitive services.

If zones that host useful resources are highly scattered, cross-zone link jitters are more likely to affect services. For example, if the ECS instances at the client’s end are located in Zones A and B, the SLB instance in Zone C, Redis Proxy in Zone D, and Redis in Zone E, then jitters of the links from A to C, from B to C, from C to D, and from D to E may affect the entire system.

Best Practices

Through the preceding case study, we can extrapolate two best practices related to networks hosts and network deployment:

Best Practice for Host Networks

Enable the multi-queue NIC feature and scatter NIC software interrupts for optimal network performance. The following general recommendations are available for ensuring stable network performance:

Use VPC instances, VPC provides a range of benefits, such as network tenant isolation, private lines, VPN gateways, and underlying forwarding capabilities much more robust than those provided by classic-network instances. The new-generation instances are implemented based on VPC, so they provide a more powerful network forwarding capability.
Use dedicated instances, which employ the technology of strict resource isolation to protect virtual machines from neighbor interference.
Enable the multi-queue NIC feature and bind the CPU affinities of NIC software interrupts so that each NIC queue is processed by a specific CPU. This improves network processing performance.
Bind multiple NIC queues to dedicated CPUs and bind other processes to other CPUs so that NIC software interrupts are processed by dedicated CPUs. This applies to forwarding-only services that require extremely high network performance.

//绑定网卡软中断的方法：
//1. 首先看cat /proc/interrupts | grep virtio，在阿里云提供的标准操作系统中，virtio0是网卡队列
~$cat /proc/interrupts  | grep virtio
//omit outputs
 31:  310437168          0          0          0   PCI-MSI-edge      virtio0-input.0
 32:  346644209          0          0          0   PCI-MSI-edge      virtio0-output.0
//将第一列的中断号记录下来，修改下面的文件绑定CPU亲和性
echo <cpu affinity mask> /proc/irq/{irq number}/smp_affinity
//具体CPU affinity mask可以参考manpage https://linux.die.net/man/1/taskset，这里不再说明。

Best Practice for Physical Networks

Take service tolerance and latency sensitivity into account for deploying services.

In terms of service tolerance, if packet loss occurs over TCP connections, in the worst case, these packets are resent only after a retransmission timeout (RTO). RTO is applicable to the tail drop scenario, whereas the fast retransmit mechanism is applicable to other scenarios. The minimum value of RTO is defined as 200 Hz, that is, 200 ms, in the kernel. In low-latency scenarios such as internal or local mutual access, the worst case is that a lost packet is resent after 200 ms. Therefore, you must set the request time-out period to a value greater than 200 ms for critical services to allow TCP retransmission. For non-critical services that do not focus on whether results are returned during a query, set the request time-out period to a smaller value for protecting the entire system. Therefore, service tolerance indicates the ability of services to tolerate errors or retransmission.
In terms of latency sensitivity, we recommend that latency-sensitive services be called within a zone to minimize the impact of faulty cross-zone links on these services. Though different zones provide the same set of services, services are not often called across zones. For example, Redis may be called by the webserver layer in the local zone to use its cache service. Databases in other zones are queried only in few cases where no cache hits are found. The impact of cross-zone link jitters is minimized through technologies such as read-only instances.

Summary

The preceding case study and best practices show that we have to make “trade-offs” when designing and deploying service system architectures. Resource optimization is critical for achieving service objectives in a given environment. Many customers may not deploy several data centers in their system architectures before cloud migration due to factors such as the costs and locations of data centers. These customers will benefit from cloud computing for cross-generation infrastructure upgrade and inherent disaster recovery capabilities. It is significant to consider the complex networking that results from disaster recovery while designing and deploying service system architectures.

The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Optimizing Network Deployment for Tolerance and Latency Sensitivity Services

Preface

Problematic Environment

Symptom

Diagnosis Approach

Troubleshooting

Best Practices

Summary

Original Source:

Optimizing Network Deployment for Tolerance and Latency Sensitivity Services

Alibaba Clouder March 19, 2020 265 As the infrastructure provided at the underlying layer of Alibaba Cloud, the…

Written by Alibaba Cloud

No responses yet