5-year Evolution of Ele.me’s Transaction System — Part 5
Step up the digitalization of your business with Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!
Message Cluster Splitting
Let’s analyze the reasons for this accident in the preceding context. According to our performance test on RabbitMQ clusters, the throughput was acceptable. However, the CPU load was very high and triggered the self-protection mechanism of RabbitMQ. As a result, message sending by producers was affected and the CPU even crashed.
With the help of the architect, we finally traced the problem back to the public SOA framework used on the merchant client. The RabbitMQ client was independently packaged by the department and did not incorporate a good understanding of some client parameters of RabbitMQ, such as the get and fetch modes and the prefetch_count parameter in fetch mode. Actually, to obtain proper values for these client parameters, you must perform some calculations. Otherwise, even if a machine has available CPU resources, the consumption capacity cannot be improved.
Ok, so what is the relationship between this problem and the order system? The answer is hybrid deployment. This cluster uses virtual hosts to separate message broadcasts for different businesses. Therefore, messages transferred for orders, waybills, and merchant clients were all deployed in the cluster.
On the day of the accident, the head of the Operation Technology Department demanded that we must set up an independent message broadcast cluster for the order system that very day, regardless of how we moved the machines. Working with the Operation Technology Department, we immediately united all consumers to set up a seven-node cluster in the evening. This allowed us to separate order message broadcast from the cluster.
(One year later, the cluster reached its bottleneck that could not be resolved through scale-up. The main reasons were: First, the consumers did not use the features of RabbitMQ to listen to messages, but filtered messages locally. As a result, some processing resources were wasted. Second, as the cluster grew in size, the number of connections could not increase beyond a certain point. We mitigated the second problem by enabling the producer to send a copy of messages to the newly built cluster. However, the ultimate solution was replacing RabbitMQ with MaxQ. MaxQ was developed by Ele.me by using the Go language after we got tired of the problems with RabbitMQ.)
PS: If we could go back in time, I would add the third step of using an asterisk (
*) to subscribe to messages and require all consumers to modify the asterisk (
*) as needed. After the improvement, we suffered performance deterioration due to insufficient control and governance. Standards, best practices, and recommendations had been provided in the initial documentation. Subsequently, some calculation formulas with adjustable parameters were also provided. However, we could not expect that all consumers would comply with the standards, nor could we depend on the Operation Technology Department to fully control the system. Service providers should also participate in system control.
Virtual Commodity Transactions and Innovations
Breakfast Order System
From late 2015 to early 2016, Ele.me had a low share of breakfast orders, which had a significant impact on the technical architecture at that time.
The interaction between takeout and breakfast is shown in the following figure.
I’m sure many of you have questions about this.
Let me explain the background first:
- We had set up a new breakfast system that was independent of our foodservice business. The system included users, shops, orders, distribution, and other nodes.
- The payment system was specially customized for the takeout business because we were unable to develop an independent payment system and the payment system was coupled to the user system before early 2016.
Therefore, for the sake of quick trial and error, the “innovative” department set up a complete e-commerce prototype for the “innovative business”. In order to use the payment feature, the department “borrowed” the transaction pipeline of the takeout business. This plan was decided on and implemented by the R&D engineers responsible for the breakfast system and the R&D engineers responsible for the payment system. The order system unconsciously became little more than a tool.
When I first learned about the breakfast system, it was already like this. At that time, PPE and PROD were not completely separated. An incorrect operation once caused the asynchronous tasks of PROD to be pulled to PPE and then transferred all at once. Finally, no worker consumed the tasks, causing the order to be canceled.
Ele.me Delivery Membership Cards
At the beginning of 2016, the business team proposed that Ele.me delivery membership cards be virtualized and sold online. Previously, the cards were physical cards and were promoted by delivery riders offline. At that time, we had just passed the architecture review and needed a low-traffic business model to practice our new architecture vision. Therefore, we developed an order system for selling virtual goods.
We abstracted the simplest state model, as shown in the following figure.
Our main ideas were as follows:
- All transactions must be essentially the same, so main nodes are relatively stable.
- The purchase behavior on the customer side is relatively simple, while the delivery on the business side may vary.
- The more critical the system, the simpler it should be.
The preceding figure shows the interactions between upstream and downstream nodes. The business team was responsible for product management, marketing, and shopping guidance. The core responsibility of the transaction system was to provide a channel to carry transaction data.
During data design, we thought the buyer and the seller, the target product, and the transaction stage were the most essential elements. Certainly, I can now come up with a more standard model, but we did not think too much back then.
We split the main transaction table into a basic table and an extension table.
The basic table included the buyer ID, the seller ID, the status code, the business type, and the payment amount. Business types were used to distinguish between buyer and seller systems.
The extension table included target products, marketing information, the recipient’s phone number, and other details. This table gave the business side the freedom to enter other information.
(Later, we found that although the upstream node could freely control the target products and marketing information, we had to constrain the paradigm at the code level. Otherwise, the governance would be complex because the business side could insert any information.)
We split the table for two reasons: First, once an order is generated, the function of snapshots is basically completed. The most critical work remaining is state maintenance, and high-frequency operations are also concentrated on states. Therefore, minimizing the size of each entry helps ensure the core process. Second, according to our experience with food orders, 2/3 of the storage space is used for details, especially several JSON fields.
After we set up the entire virtual order system, many platform sales services were connected to the platform by using this system. We only spent two to three days developing and testing service access capabilities. Generally, a service could be released within a week. We were very happy, and so was the frontend business team. Because we did not have to deal with large-scale query scenarios, the system could easily complete hundreds of thousands of orders per day, and several dozen CPU cores were more than sufficient.
This system was actually the prototype of a simple PaaS system.
We actually created some derivative services related to transactions. However, due to the impact of the organizational structure, the order team was usually responsible for these services at that time.
For example, our team was in charge of “Punctual Delivery and we implemented this technology from scratch. At the same time, a Transaction Compensation Center was derived from this technology to close all claims (such as red envelopes, vouchers, cash, and reward points) during a transaction process.
To improve the user experience involving transactions, we launched a Transaction Notification Center, which later became a universal notification center in Alibaba. The Transaction Notification Center integrated all notification methods used for transactions, including SMS messages, push notifications, and phone calls. This increased the rate of notification arrival in extreme cases and reduced the annoyance of repeated notification.