In the financial and banking industry, having a high-availability system is essential to ensure longer service availability time, more reliable services, and shorter downtime. Financial enterprises need to ensure high-availability of the services that they provide to the public and strive for more 9s (more 9s in the SLA means higher availability, for example, 99.999%). However, as the complexity of software systems continues to increase, failures are inevitable. This makes it necessary for enterprises to implement a holistic resilient architecture that is designed to deal with failures.
The common RPC and RMI integration technologies used by many enterprises, which are synchronous requests, often negatively impact end user experience due to failures on the execution side, timeout or other factors. In addition, many failures cannot be completely eliminated. For RPC and RMI calls, both service consumers and service providers need to be online at the same time and they need some mechanism to confirm each other’s call relationship. These disadvantages led to the implementation of message-oriented middleware (MOM), which can be integrated into an enterprise’s architecture to minimize the number of systems affected when a failure occurs.
Message-oriented middleware is a transparent middle layer that is integrated into a distributed system to separate service providers from service consumers.
What Is MQ?
A message queue (MQ) is a cross-process communication method used between applications to send messages between upstream and downstream applications. Let’s break it down:
- Message: Applications communicate by writing and retrieving inbound and outbound data (messages) in a queue.
- Queue: A queue eliminates the need for applications to send and receive at the same time.
This allows decoupling between upstream and downstream applications. The upstream application sends messages to MQ and the downstream application receives messages from MQ. The upstream and downstream applications no longer depend on each other; instead, they only depend on MQ. Due to the queuing mechanism, MQ can act as a buffer between the upstream and downstream applications. Messages from the upstream application are cached, and the downstream application then pulls messages from MQ when it can, reducing peak traffic.
Benefits of MQ
What is decoupling?
High cohesion and low coupling are software engineering concepts. Low coupling implies that individual components are as independent of each other as possible. Simply put, this requires more transparency in calls among modules. The highest level of transparency is when individual calls have no reliance upon each other. To achieve this, we need to reduce the complexity of interfaces, normalize call methods and transmitted information, reduce the dependency among product modules, and improve reusability.
How to decouple?
Decoupling in an overall enterprise architecture mainly involves two aspects: one is to simplify and reduce interaction, and the other is to add a middle layer to separate two services. MQ acts as this middle layer (as shown in the following diagram).
With MQ, the producer and the consumer don’t have to be aware of each other and they don’t have to be online at the same time. The main interaction flow is shown as follows:
- Producer: produces messages and sends messages to MQ through SDK or API calls (either synchronously or asynchronously);
- MQ: receives and persists messages in the message storage (either synchronously or asynchronously);
- The producer receives responses (message delivery status or exceptions) from MQ;
- The consumer subscribes to the messages and then receives messages from MQ;
- The consumer performs service actions corresponding to the messages;
- Then the consumer confirms the consumption results (such as success, failure, or exception).
Since a system’s busy and idle hours vary, the QPS difference can fluctuate exponentially. Especially in the case of marketing activities, traffic can instantly jump beyond the load capacity of the backend systems. In these situations, message-oriented middleware can be used to buffer traffic. The MQ client then pulls message from the MQ server based on its own processing capacity to reduce or eliminate the bottlenecks on the backend systems.
For various reasons, during enterprise informationization, it is inevitable that software products are provided by different manufacturers and are designed to solve specific problems. These products cannot provide external services due to their closed architectures or lack of core development, which presents integration challenges. This problem can be partially resolved by integrating MQ. With MQ, the only requirement is for a specific process to produce a message or provide a specific response to the message and simply connect to MQ, without having to establish direct connections to other systems.
In order to provide resilient financial services, the dependencies between internal and external systems need to be isolated. There are two types of payment notifications: synchronous notifications and asynchronous notifications. For synchronous notifications, API calls may time out due to network failures resulting from factors such as the service provider’s response being delayed due to insufficient processing capability; for asynchronous notifications, notifications only need to be successfully sent within a specific period to improve the end user experience and transaction success rate as well as the overall service production efficiency.
MQ Model Selection
All choices are inevitably subject to objective and subjective factors. However, we should select architecture and framework models as objectively as possible and avoid retroactively justifying the selection after we see results. I’ll share our MQ model selection process (I am not saying that subjective factors aren’t relevant, but an engineer always needs to consider structure and quantification).
- Cluster support: To ensure the reliability of the message middleware, it must provide well-developed producer, consumer and message-oriented middleware cluster plans;
- Persistence support: To avoid message loss, it must support saving messages to disks or another storage medium;
- Message retry support: If message processing fails, dumping or retying failed messages must be supported, and messages can be configured to be delivered at least or at most once;
- Distributed transactions support: To ensure service integrity, the selected middleware must support distributed transactions;
- Sequential message consumption: In some scenarios, messages must be consumed in the same order as they were sent to ensure correct processing order;
- Message delay support: In the case of processing 2C services or connecting to third-party data sources, message delivery may need to be delayed. This requires the support for delayed delivery;
- Message accumulation and backtracking: There should be no significant impact on performance when a large number of messages are persisted in message-oriented middleware. Message querying, re-sending, or re-consuming messages at specific points in time should be supported to handle scenarios needing these operations.
- Matching between products and current technology stacks: It is easier for team members to understand the principles of message-oriented middleware and extend functionality if they are very familiar with the source code.
- Broad product adoption (especially within the financial industry): Due to relatively homogeneous services and similar scenario requirements, more users typically means better support for critical scenarios and that potential problems are more likely to be exposed. This makes it easier to find appropriate solutions or perform troubleshooting;
- High product availability: Financial enterprises need to ensure highly-available services. Clusters and high availability are essential to basic message platforms that improve enterprise resilience;
- Product stability: The product provides continuous and stable services, without having to be frequently restarted due to resource leakage or performance issues.
- Product activity: The following information should be reflected in GitHub statistics: whether a product is regularly maintained, whether new features are regularly developed and whether bug fixing is regularly performed.
Model Selection Keypoints and Principles
- Find and add frameworks that meet critical requirements to the candidate list;
- Filter framework candidates by combining functionality, non-functionality factors and other important aspects;
- Record quantitative information during the filtering process and avoid curating a preferred filter result before the actual filtering process; model selection should be a game of numbers;
- Sometimes it is useful to think from a different perspective. A commonly compared model is likely to be the best option. For example, since Kafka is often compared with many MQ frameworks, Kafka is probably the most common framework, and if it is on the candidate list, the focus should be analyzing whether it meets your requirements;
- Follow the “Practical” principle and maintain rationality over sensibility;
- The most suitable option is the best choice. High performance and comprehensive features, while important, aren’t everything.
Suggestions on Model Selection
- When selecting a model, you should consider the next three, five, or even ten years. Ted Neward divided technology familiarity into four stages (laymen, explorer, expert, and master). It takes about one year to select and popularize a model, and another year for users to get familiar and proficient. Nobody would want a product that no longer meets their needs shortly after becoming familiar with that product.
- Prioritize your needs, and select a model that best fits your critical needs.
- Decide the selection deadline and carry out the project as scheduled. Practical project implementation is preferred over making theoretical analyses. The product needs to be developed, and team members need to learn more about it. With a clear path ahead, why not set out on the journey immediately?
Comparison of product characteristics
You are recommended to build a POC environment to verify relevant functionality indicators as well as usability. Therefore, during the testing process, specific application scenarios should be set up based on the features provided by MQ to verify the implementation of service functions.
Performance testing: Performance testing actually involves too many factors, such as which environment a product is based on, which configuration has been made, and which stress testing script and report are used to perform stress testing. Indicator comparison: In addition to TPS (sender’s TPS and TPS for consumers’ final processing service), a number of other factors should be compared, such as latency, online connections supported at the same time (data volume of the producer and data volume of the consumer), Topic configuration (the number of topics and the relationships between the number of queues for each Topic and the data volume respectively for the producer and the consumer), the performance indicators of the server (CPU, memory, disk IO and network IO).
Fatigue testing: After running in a certain load level continuously for 24 hours, one week or longer time, how stable is the product? How are the indicators of the server? Is a slow increase trend observed?
Restart or failure rehearsal: Restart (or kill) partial or all instances of NameServer, Broker, Producer, and Consumer respectively in the registry in the event of disk failures or network failures and see the influence on applications, for example, check if the RocketMQ service can recover, if the producer and the consumer can restore the services, and if any messages are missing or duplicated.
Reasons for Selecting RocketMQ
- ActiveMQ supports numerous protocols and provides a variety of documents, but it has some limits on performance and sequential delivery;
- Kafka has high throughput, but it provides limited support for distributed transactions, retry of failed consumption, delayed/scheduled messages;
- RabbitMQ features easy integration with SpringBoot, but it provides insufficient support for distributed transactions and delayed/scheduled messages;
- RockeMQ provides good support for high throughput, sequential messages, delayed messages, message accumulation, but it supports only a few protocols and multi-language clients.
The following are the reasons why we eventually chose RocketMQ:
- Some financial service scenarios have high requirements on sequential messages;
- Some financial service scenarios require delayed messages;
- Support for distributed transactions is required to ensure eventual consistency;
- To ensure consistency, message checking requires that the MQ message middleware should support message query;
- RocketMQ provides high consistency (persistence and message retry);
- The Java technology stack used in the industry does not temporarily require multi-protocol and multi-language support;
- Among RocketMQ users are famous Internet Finance companies in China, representatives of Internet banks and direct banks (for example, China Minsheng Bank), and enterprise software providers. Judging from the model selection requirements (in 3.1) and main RocketMQ users, RocketMQ is suitable for financial scenarios and well-recognized in the industry. More requirements will be met in the future as financial customers continue to use RocketMQ.
Suggestions on Second Encapsulation
Encapsulation mainly refers to the abstraction and encapsulation of services, technologies, and data. Encapsulation has the following advantages:
- Transparency: The SDK glue layer of the second encapsulation is used to hide implementation details.
- Reusability: It improves the reusability of the basic code as well as code maintainability and reliability.
- Security: Normalized operations can improve security.
Encapsulate norms into the basic code to apply uniform interaction standards inside an enterprise. These norms include:
- Topic naming norms
- Producer naming norms
- Consumer naming norms
- Encapsulation based on MessageId, key or service ID idempotence universal schemes
With programming norms, we can locate service scenarios such as the corresponding projects and modules by using names, so that personnel coordination can be quickly done to handle unknown problems. Of course, it is also necessary to manage original data such as topic, producer, and consumer, because naming norms cannot hold too much information. Naming norms can also help avoid conflicts. For example, conflicts or misunderstanding among topics can cause message consuming or service transtion problems; conflicts among consumers’ GroupIDs may cause message loss.
Encapsulation and customization can improve the management abilities, such as batch-querying information, batch-resending information, and message checking.
RocketMQ is the important basic middleware that improves the overall service resilience and plays a significant role in various financial transactions.
Let’s take the current account transfer-out scenario for example.
- For security reasons, an SMS notification will be sent for each transfer-out transaction in the current account. This can be achieved through RPC calls, as shown in the following diagram:
- Due to business development, security and anti-fraud concerns, anti-fraud services are set up and event tracking data in the payment scenario needs to be sent for each transfer-out transaction. Two solutions are available for this purpose. As shown in the following diagram, the transfer-out response time may be the processing time A + SMS call time B + anti-fraud service call time C (transfer-out response time + Max (B, C) in the event of asynchronous RPC calls). In addition to the long time, there are more system failures. If anti-fraud services or SMS service experiences downtime, it is very likely that the transfer-out service of the current account also experiences downtime (influence caused by exceptional calls among systems can be significantly avoided if fusing and isolation are used).
- The other is a proactive solution. Now that a second service needs to understand the event, there will probably be a third service, a fourth service, or even more (as shown in the following figure). In this scenario, the stability of the transfer-out service in the current account is further reduced while the complexity increases sharply. The associated requirements of the N systems will be taken into consideration at the time of processing login logic.
- From the field model perspective, should the transfer-out service for the current account be responsible for handling these events? Obviously not. The transfer-out service should focus only on the account action itself. Other services should depend on the transfer-out event to create relevant business logic. Message oriented middleware is suitable for decoupling in this scenario. Please see the following diagram for details.
This leaves each service as it should be. The transfer-out service only produces the messages of the transfer-out event, without having to handle other services’ dependency on it. Services that require the transfer-out event can subscribe to the produced messages. The transfer-out service and other services are independent.
Third-Party Delayed Calls
For connections to external banking systems, sometimes asynchronous calls are needed to obtain response results in N seconds after service requests are sent. Before the use of delayed messages provided by RocketMQ, the common method is to store data into databases or Redis cache first and then use scheduled polling tasks to perform operations. This method has the following disadvantages:
- It is not universal solution, and a separate polling task is required for each system;
- Polling tasks may influence performance as the data volume increases;
- Because it is not suitable to set a too short polling period (usually in minutes), the timeliness is too low. A too short polling period will increase the stress on the databases;
- The retry mechanisms targeting polling failures need to be implemented separately.
RocketMQ can bring the following advantages:
- It provides a universal delay solution that implements the isolation between message production and consumption;
- It supports high concurrency and has excellent horizontal scalability;
- Timeliness can be significantly improved (accurate down to seconds).
However, the current version does not support setting time granularity, and only allows messageDelayLevel-specific settings.This requires that the delay level should be planned ahead of setting up RocketMQ or that the RocketMQ delay source code should be extended to support a specific time granularity.
Scenarios Not Suitable for MQ
Why don’t all calls use MQ when it has so many advantages? The ubiquitous economics always teaches us a lesson: benefits come hand in hand with costs.
Disadvantages of MQ:
- Its support for high flexibility increases the overral system complexity.
- Temporary inconsistency may occur because the latency of asynchronous calls is greater than that of synchronous RPC calls.
- It cannot provide strong consistency, which needs to be handled by distributed transactions.
- The service consumer requires idempotence design to avoid repeated calls.
So in the normal course of software development, we don’t have to purposely look for the application scenarios of message queues. Instead, when performance bottlenecks occur, we should check if service logic includes time-consuming operations that can be asynchronously processed. If these operations are present, use MQ to process them. Otherwise, the blind use of message queues may increase the cost of maintenance and development and provide insignificant performance improvement, which is not worth the input.
Problems encountered with incorrect usage:
- Improper number of topic queues causes unbalanced loading, which negatively influences the stability of Broker in RocketMQ.
- Message accumulation causes longer delivery latency.
- Consumer A consumes the message intended for consumer B due to non-standard consumer cluster ID settings.
- The downtime of one Broker node due to inappropriate topic settings causes exceptional message delivery and consumption.
- Due to non-standard topic and consumer GroupID naming norms, topics of RocketMQ, after running for some time, become disordered and cannot clean or recycle resources.
Since failures are inevitable, self-management, self-recovery and self-configuration and other necessary functions need to be implemented through a series of mechanisms in the process of designing applications. As cloud-native architectures become more popular, MQ is not the only option for designing resilient systems to handle large loading and various failures. However, MQ is still a good choice and worth a try.