Low-Latency Distributed Messaging with RocketMQ — Part 1

This is the first part of the blog — Low-Latency Distributed Messaging with RocketMQ. In this blog, we will discuss about the history of the Alibaba Cloud message engine family. We will talk about reducing latency, as well as the advantages and disadvantages of using page cache. We will then talk about the development of RocketMQ and how it eliminates latency.

History of the Messaging Engine Family

The second generation, also known as pull mode, bears a proprietary message storage system developed by Alibaba Cloud. Its throughput performance was comparable to that of Kafka. However, given Taobao’s application scenarios, especially the scenarios requiring high reliability such as transaction chains, the messaging engine places priority on stability and reliability rather than throughput. The real-time performance of message delivery remains superior to push mode given its adoption of a persistent-connection pull mode.

After the first two generations went live and underwent numerous enhancements for many years, the team at Alibaba Cloud developed the high-performance and low-latency messaging engine RocketMQ. Developed in 2011, it is primarily based on the amalgamation of the pull and push modes. RocketMQ became open source in 2012. After 6 years of testing on transaction chains of Alibaba’s Double Eleven (Singles’ Day) event core, the software survived and thrived.

Alibaba Cloud donated RocketMQ to the Apache Software Foundation (ASF). Ever since, RocketMQ’s popularity flourished, and is becoming one of the most important distributed messaging engine within the Apache Community, after ActiveMQ and Kafka. So far, RocketMQ has provided laudable service for thousands of applications within the Alibaba Group. RocketMQ has played a decisive role in the stability of the group’s incredible mid-range terabyte-scaled message flow.

Figure 1.

The Quest for High Availability and Low Latency

Relationship between Availability and Latency

The performance metrics for an application tends to consider the throughput and latency. Throughput refers to the number of requests that a program can process over a period. Latency is the end-to-end response time.

Low latency has different definitions under different circumstances. For example, in a chat application, we can define low latency as within 200ms, and in a transaction system, we can define it as within 10ms. Compared with throughput, latency is susceptible to the effects of numerous factors, such as the CPU, network, memory, and operating system.

As per Little’s law, when latency becomes too high, the requests residing in the distributed system will increase sharply, rendering some nodes unavailable. The state of unavailability can even spread to other nodes, resulting in the loss of service capacity to the entire system, resulting in the fault termed an avalanche. Therefore, to improve the availability of the overall distributed system, we need to create low-latency applications

Ways of Exploring Low Latency

In 2016, Tmall launched new methods of playing their ‘Red Envelope’ game for Singles’ Day. The game was very sensitive to latency, and could only tolerate latency of within 50ms. During the initial stage of the load testing, there were massive latencies of 50–500ms when RocketMQ wrote messages, resulting in several failures during the peak usage of the Red Envelope game, extensively affecting the front-end business. The picture below is the thermodynamic diagram of latency of red envelope cluster writing messages during load testing.

Figure 2.

As a messaging engine developed in pure Java language, RocketMQ’s self-contained memory modules rely on Page Caches to accelerate and stack. This means that such factors as JVM, GC, kernel, Linux memory management mechanisms, and the file IO will affect its performance. As shown below, the risk of latency exists in each link from sending a message from the client to the final persistence on a disk. From observations of online data, there are accidental latencies of seconds in RocketMQ message writing chain.

Figure 3.

JVM Pause

For other JVM pauses, the system can output JVM pause time to the GC log through -XX: +PrintGCApplicationStoppedTime. The specific reason for the pause is output through -XX: +PrintSafepointStatistics -XX: PrintSafepointStatisticsCount=1, and specific optimizations are performed. In case many pauses of RevokeBias emerge in RocketMQ, the best option is to close the biased locking feature through -XX: -UseBiasedLocking. Additionally, the output of the GC log can produce a file IO and sometimes lead to unwanted pauses. It is possible to output GC logs to tmpfs (memory file system), but tmpfs will occupy memory. To avoid memory waste, you can roll GC logs with -XX: +UseGCLogFileRotation. Despite GC logs producing file IOs, JVM will output some statistical data requested by the jstat command to the directory /tmp(hsperfdata). You can close the features through -XX: +PerfDisableSharedMem, and JMX is used to replace the jstat.

Locking — A Powerful Tool for Synchronization

Locks in Java are non-fair locks by default. Locking does not consider queuing problems and hence the direct attempts to obtain locks. If it fails, queuing starts automatically. Non-fair locks will cause long thread waiting times and higher latencies. Often, the use of fair locks tends to result in greater performance losses to applications.

On the other hand, synchronization will lead to context switches which bring about certain expenses. Context switches are often at a level of microseconds. But if there are too many threads and competitive pressures, dozens of millisecond-level expenses will occur. We can use LockSupport.park to imitate context switches for testing.

To avoid latency resulting from locking, use CAS primitive to unlock RocketMQ kernel chain, reducing latency and significantly improve throughput.

High Memory Consumption

Figure 4.

On the other hand, the kernel will also reclaim anonymous memory pages. After swap-outs of the anonymous memory pages, the next access will produce a file IO, resulting in increased latency, as shown below.

Figure 5.

The latencies produced in the two cases are avoidable by adjustments and optimizations to the kernel parameters (vm.extra_free_kbytes and vm.swappiness).

Memory management is by Linux and occurs in pages, 4KB for each page. Latency arises when read-write competition occurs within the memory on the same page. In such cases, it is necessary for applications to adjust memory access to avoid these latencies.

Page Cache — Advantages and Disadvantages

Figure 6.

In most cases, the read-write speed in this mode is prompt. However, when the operating system performs a writeback of dirty pages, memory reclaim and memory swaps-in and out, large read-write latencies emerge, resulting in an accidental high latency state of the storage engine.

In response to this, RocketMQ uses a variety of optimization techniques, such as memory pre-distribution, file preheating, mlock system calls and read/write splitting to ensure the elimination of the latency resulting from page caches while utilizing their advantages.

Optimization Results

99.995% of latency rates from the optimized RocketMQ’s message writing are within 1ms, and 100% within 100ms, as shown below.

Figure 7.






Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com