Design Ideas for Improving the Transaction System of Ele.me, Alibaba’s Food Delivery Service

18 min readJul 8, 2020

By Sheng He, nicknamed Baicha at Alibaba. Sheng He is from Alibaba’s Local Lifestyle Mid-End R&D Department. He has years of experience in transaction system development.

I joined the business department of Ele.me, Alibaba’s food delivery service, in May 2017 and developed a range of systems related to searching, ordering, timeout, compensation, agreements, delivery, amount calculation, and rating. Later, I was also involved in system upgrade work.

I wrote this article after phase one of the transaction system reconstruction project to reflect on the decision-making process involved. In this article, I choose not to use the term “architecture” to avoid conveying a sense that we would be talking about extremely important decisions and engaging in abstruse technical analysis. Rather, what we did was more of a reconstruction.

Fellow researcher at Alibaba, Bixuan wrote the following words in the article “Routine of System Design”.

I reviewed my past work in system design and found that I followed this routine sequence when designing a system: purpose of system design > goals of system design > goal-centered core design > design principles formulated based on core design > detailed design of each subsystem and module.

One important takeaway is that the first step of system design is to clarify a purpose and formulate measurable goals.

“Soft”ware

As Robert Cecil Martin once said, the term “software” meaningfully includes the concepts of “soft”, which has the implication that software is a fundamentally flexible product.

The code of the first edition of the transaction system can be traced back to eight years ago, when the system was disassembled and reconstructed. When I joined Ele.me in 2017, the main system was looked something like this:

The system runs online services that work with millions of orders and even tens of millions of orders. Based on the performance observed during stress tests, the system can provide stable support for business volumes several times greater, even if no changes are made. However, the stability of the system is uncertain if these sorts of changes are introduced.

Since I joined Ele.me two years ago, the businesses on the system have changed from only having restaurant take-out services to include new retail and branded catering services, and we now also support commercial deliveries. This means the system must support a growing range of differentiated businesses and parallel business launches. In addition, the change in the company’s organizational structure requires that projects be completed through the collaboration of three teams. This doubles the costs of communication and coordination. As a result, the R&D team cannot fully plan the evolution roadmaps of most systems.

Several months ago, the business team posed a simple requirement: automatically review transaction ratings and impose penalties accordingly. The following figure shows the domain model at the core of the rating system.

This requirement involved several modifications to multiple rating submodules. This workload far exceeded our original expectations. Naturally, the business team was not satisfied. Similar conflicts often occur in other systems. The R&D team failed to develop new functions, only making minor modifications to the system.

This is because it is very difficult to modify most systems based on modified requirements. Minor requirements proposed by the business team may require extensive system modifications. Systems are not supposed to evolve like this. They must be simple and flexible.

Therefore, the core goal of system design is to use an effective software architecture to save manpower during project the development and maintenance stages, make each change simple and easy to implement while avoiding bugs, and meet functional and flexibility requirements to the maximum extent at minimal costs.

Source Code is the Design

Architecture design is more than a series of clear-cut architecture diagrams. In 1992 Jack Reeves published the essay “The Source Code is the Design”, in which he proposed the following opinion:

The design of a high-level structure is not complete software design but a structural framework of the detailed design. Our ability to strictly verify high-level design is very limited. Detailed design will eventually have the same impact on high-level design as other factors (or permit such impact). Improving all aspects of design is a process that must go through the entire design cycle.

After trial and error, I have found that the emphasis on detailed design has practical implications. Simply put, top-down designs are unreliable and coding is a part of the design process. I think that system design must proceed from bottom to top. That is, to put another way, good high-level designs come from the continuous evolution of the abstraction level.

Programming Paradigms

Coding is the starting point for designing a system from bottom to top. The transaction system of Ele.me was developed based on Python, which is flexible enough to quickly produce MVP system versions. This was perfectly adapted to the company’s development status at that time: fast product iteration and high pressure from new projects.

The recent reconstruction uses Java to conform to the company’s development trend. At the end of 2017, we compiled some new services by using Go because we expected the existing system framework to encounter a bottleneck when the number of orders reached the next level. However, some developers were not accustomed to compile services by using Go. This was due to the lack of a framework, a generic model, and try-catch. Go is not the best choice for solving business problems. However, the simple syntax of Go can minimize the probability of errors on the part of programmers.

Python is highly expressive and flexible but not properly used by many programmers, who overuse dynamic languages and make a great deal of errors. This may hamper the management and maintenance of large projects. According to the author of Rails, “flexibility is overrated, and constraints are a type of liberation.” This makes sense.

Drawing on my programming experience in C++, Go, Python, and Java, I consider programming paradigms to be an important way to learn any programming language. Simply put, programming paradigms help programmers define what a program is. Sadly, they are often ignored. The old transaction system was compiled solely through procedure-oriented programming (POP), with no consideration of the business logic, and similar code abounds.

We seem to have forgotten about object-oriented programming (OOP) seems to decrease, but I do not mean to suggest that OOP is the optimal programming paradigm. We are supporters of problem-oriented programming. For example, OOP is a core component of Java, but it is not necessarily required by business processes. POP is suitable for business processes when each step is clearly determined. In this case, complex class design is unnecessary and may even cause trouble.

A problem can be divided into different levels, each of which can be solved in an appropriate way. For example, you can use OOP to solve high-level problems and use functional programming (FP) when executing specific logic. We compiled FP-based underlying compute services by using Go, which features high performance, simple syntax, and a low probability of errors. We used Go to effectively solve problems.

OOP has shown it can support complex and extensive software design in diverse business scenarios involving transactions. Therefore, we made our first decision by choosing to use a combination of programming paradigms, with OOP at the core.

Principles and Patterns

The difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships. — Linus Torvalds

The basic modules compiled by using a programming language based on a programming paradigm have an impact on the final result. In my opinion, the relationships in the quote from Linus Torvalds are the interactions between classes. The quality of a relationship indicates whether the software design is good or bad. Software structures with poor design have the following characteristics:

Rigid: Software modification is complex and often requires further changes. For example, when a new marketing type is added to the ordering service, the order center and related upstream and downstream elements must perceive the change and make modifications accordingly.
Vulnerable: Simple changes may cause unexpected problems, which may be unrelated to their purposes.
Immobile: The design involves the useful parts of other systems, but splitting these parts is risky and costly. For Ele.me this was the issues that an order center may not support payment by member cards or other virtual means for takeout.
Unnecessarily complex: A system is overdesigned.
Obscure: With the passage of time, modules and code are more difficult to understand. Again, for Ele.me, this was the issue that the core code of the shopping cart feature has grown into a large function spanning nearly 1,000 lines.

After determining an appropriate programming paradigm, we need to extract an upper level to focus on the logic beyond the code. Basic principles and models have been accumulated over years of software engineering and can guide us through encapsulating data and functions, which then can be organized to develop a program.

SOLID

These principles are divided into the single responsibility principle (SRP), open closed principle (OCP), Liskov substitution principle (LSP), interface-segregation principle (ISP), and dependency inversion principle (DIP), which together are abbreviated as SOLID. Next, we will provide examples of several of these principles.

Single responsibility principle (SRP): One software module is responsible for only one user type. Therefore, the code and data closely related to a user type must be organized together. Most of the time we discover and then divide responsibilities.

In my opinion, user definition is at the core of SRP. I would like to quote Yu Jun’s definition of a user from the 2018 QCon: “A user is not a person but a set of requirements.” During the reconstruction process, we had a debate about the delivery procedure of the transaction system. Currently, Ele.me supports distribution by these metrics: merchants, platform-managed distribution, and selective distribution, such as running errands. These distribution modes have different pricing models, distribution logic, and scenarios. Therefore, we initially divided our code based on these differences.

Later, new retail merchants and food service merchants were split from the merchant group, and the operation models of business parties changed accordingly. This led to different requirements in each distribution mode. To adapt to these changes, we performed secondary separation.

If you find it difficult to analyze from the perspective of SRP, look at the code with conflicts due to merged branches. Remember that different programmers may modify the same module at the same time to meet different requirements.

Dependency inversion principle (DIP): Some consider that dependency inversion differentiates between OOP and POP based on the dependency created during procedural design. Policies depend on details. What I mean is that the upper level depends on the lower level. As a result, policies may be changed by details. For example, you can perform POP as follows to allow merchants to give away coupons to users who could not receive takeout orders.

The platform needs to retain dissatisfied users by giving away red envelopes for repurchases because coupons can be used only at specified stores. This requires us to modify the code to add the dependency of the red envelope compensation logic.

However, the problem can be solved even more elegantly based on DIP.

This solution applies to complex scenarios. OOP inverts the dependencies of policies on details so that details depend on abstractions. Service interfaces are often managed by customers, so abstraction is the key.

Open closed principle (OCP): OCP is the first goal of system design and the final result of the other principles. For example, the modules of each business line are split based on SRP, which changes the isolation status. The platform implements abstraction to filter core business processes, which are defined by each business line. This requires DIP.
Inversion of control (IoC) is another principle of programming. For example, on a takeout transaction platform, a user pays for a meal and then the merchant delivers the meal. This requires strong coupling between user and merchant, who must meet in person to complete the transaction. Ele.me provides a guarantee for the transaction. The user’s payment is deposited to Ele.me, and then the merchant takes the order and delivers the meal. Finally, Ele.me remits the payment to the merchant after the user confirms to have received the meal. The direct dependency and control between seller and buyer are inverted so that the other party depends on the interface of a standard transaction model.

You can discover these principles in action by summarizing rules. No matter which principle is used, we need to constantly modify code based on actual requirements. Principles must be used conditionally. Rigid conformity to principles may result in unnecessarily complex code. For example, code that uses the factory mode may contain “new” in violation of DIP.

The Evolution to Patterns

Here by patterns I mean design patterns. I use the word “evolution” because design patterns are not the beginning but the destination of design. The book Design Patterns is not based on the author’s original ideas but contains common practices accumulated in many actual systems. These practices are systematically organized and presented for the first time in this book. Design patterns may be naturally reflected in system code as long as we conform to the preceding principles. In Agile Software Development: Principles, Patterns, and Practices, one chapter describes the process by which a segment of code slowly evolves to the Observer pattern through adjustment.

Design patterns are helpful. For example, we can use the Template Method pattern to define a complete set of search parameter parsing templates for a search system, and customize different query requirements simply by adding configurations. Do not use design patterns to drive programming. Consider the state machine of a transaction system as an example. A state machine is like a lamp with the On/Off switch but is more complex in transaction scenarios. The following state transition model exists in the takeout transaction scenario:

A nesting switch and case statements can implement this limited state machine. The simplified sample code is as follows.

public class Order {
    // States
    public static final int ACCEPT = 5;
    public static final int SETTLED = 9;
    ..
    // Events
    public static final int ARRIVED = 1; // 订单送达
    
    public void event(int event) {
        switch (state) {
            case ACCEPT:
                switch (event) {
                    case ARRIVED:
                        state = SETTLED;
                        //to do action
                        break
                    case 
                            
                 }
        }  
    }
}

The sample code seems acceptable because the process is simplified. However, for a complex state machine that manages the order status, the switch and case statements may expand without limit to make the code difficult to read. Another problem is that the logic and action of the state machine are not split. Design Patterns proposes that a State pattern be implemented as follows.

The State pattern splits the action and logic of the state machine. As the number of states increases, new State classes complicate the system. OCP is not effectively supported. New classes cause changes in the state transition class, and the logic of the state machine is hidden in discrete code.

The old transaction system was implemented by parsing migration tables. The simplified sample code is as follows.

# 完结订单
add_transition(trigger=ARRIVED,
               src=ACCEPT,
               dest=SETTLED,
               on_start=_set_order_settled_at,
               set_state=_set_state_with_record, // 变更状态
               on_end=_push_to_transcore)
...
# 引擎
def event_fire(event, current_state):
    for transition in transitions:
        if transition.on_start == current_state && transition.trigger == event:
            transition.on_start()
            current_state = transition.dest
            transition.on_end()

The sample code is readable. The state logic is concentrated and highly scalable and is not coupled with the action. The only disadvantage is the traversal duration, which can be optimized by using a dictionary. Overall, the code has more advantages than disadvantages.

To cope with business growth, the transaction system must support multiple state machines. This requires the creation of multiple migration tables and business-based scaling and customization, which may complicate coding. To solve this problem, we used level one orchestration and a process engine during reconstruction. I would like to emphasize our second decision: Analyze code problems based on design principles and solve the problems based on an appropriate design pattern. I think it is important that you do not drive programming based on design patterns, such as by replacing the singleton pattern with a global variable.

Rich Domain Meanings

It is difficult to explain the meaning of “beautiful” without mentioning things with this quality.

So far, I have discussed the policies used to solve static problems. Now I want to talk about the solutions to dynamic problems. The definition of “stability” is unrelated to the frequency of change, but related to the cost of change. For example, a leaf cannot be considered stable even when it remains motionless on a windless day because the leaf will shake with the slightest breeze. We need to write code clearly and appropriately not only to meet current requirements but also to adapt to changes.

To formulate a design oriented toward business changes, we need to understand the core problems of the business and then divide these problems into different domains. Domain-driven design (DDD) is an approach that has proven effective. Our third decision concerned how to use DDD to guide development. I am still in the elementary phase of learning DDD, so I can only give my current insights.

General-purpose Programming Languages

A well-designed architecture has another major behavior-related impact on systems. That is, this architecture clearly and explicitly reflects the purpose of the system design. Any segment of code must reflect its purpose, such as implementing a transaction system application, at first glance. Our code must be consistent with our business logic. Which of the following two methods of package classification is more understandable?

We can discover the general-purpose programming language in a domain to understand the meaning of the domain and then deal with necessary changes. This process depends on many objective conditions, such as whether the team has an expert in this domain. If these objective conditions are unfulfilled, we can find a solution internally. Once I saw one of my friends working as a programmer at DXY, an online professional community for pharmaceutical and life science sectors, bought many medical books. I was sure that he had become a believer in DDD.

We visualized domain elements during reconstruction in the belief that the source code is the design. We added agreed-upon annotations when some concepts in the system domain were consistent with products. We scanned and collected code during compilation and sent it to the frontend for drawing.

Let’s return to the rating domain model. After repeated communication with the product team, we realized that the product team did not expect such a wide variety of rating categories. Only products and takeout delivery couriers needed to be rated. From the perspective of the domain model, the previous design was more oriented toward scenarios than behaviors. The appropriate domain model is as follows.

Boundary Context

Boundary context is common in development. Let’s look at a user system as an example. The objects of a user vary depending on different perspectives. From the user’s perspective, the objects include logon, logout, and nickname change. From other users’ perspective, the only object is the displayed nickname. From the background administrator’s perspective, the objects include deregistration and forced logout. When this is the case, we need to define a scope for the user, which is equivalent to the boundary context in DDD.

The boundary context can effectively isolate the different connotations of the same thing. We can access the object model of the context based on strict specifications to protect the consistency of business abstraction behaviors. In the transaction domain, Ele.me took the initiative to support SVIP. The settlement for SVIP must be implemented by the transaction system. We made it less complex by analyzing and dividing problems into the member domain and transaction domain. We developed a mapping to protect the internal business logic of transactions when SVIP cards were introduced into the transaction domain.

Splitting

After coding is completed, the increasing number of programs requires more people to participate in coding. To facilitate collaboration, we need to divide the code into groups that are easy to maintain by individuals or teams. The code can be divided into the following components based on the software change frequency:

Extension: The extension package stores the business customization package. Oriented toward objects, the extension package supports switching the logic of a program by using a polymorphic programming language and plugin. The history of software development technologies is a process of adding easy-to-use plugins to build a scalable and maintainable system architecture.
Domain: The domain package stores the core business package with the general-purpose language of a domain. The domain package is the most stable of all packages.
Business: The business package stores specific business logic. When the domain package provides a people.run() method, the business package uses this method to deliver takeout or workouts.
Infra: The infrastructure package stores dependencies on databases and middleware, which are details that go beyond the business logic.

Now, let’s turn to dependency layering. Martin Fowler provides a classic layered encapsulation model. A simplified order module is used as an example.

If you do not want to perform different types of conversion or observe strict dependency layering and think that some queries (Query, Query ! = Read) can bypass the domain layer, then you can use the CQRS model.

Ideally, the domain layer, as the core business logic, does not depend on the details of the infrastructure. This makes the code more predictable.

After a single application is divided into individual components, we focus on four core services at the upper level. Booking is divided into Cart, Buy, and Calculate. Eos is divided into Process, Query, and Timeout. The functions of Blink related to merchant orders are divided into Process and Query. The functions of Blink related to logistics and delivery are grouped into Delivery. The following figure shows how the core transaction services are split.

Our forth decision concerned the split method. It is actually unnecessary give an order to these four decisions because they all serve the goal of flexibility. We selected or avoided some doctrines in the processes from selecting programming paradigms and compiling components to layering. In a sense, a business architecture limits some behaviors of a programmer in a domain so that the programmer can write code in the expected way. This ensures system flexibility and reliability.

“No Silver Bullet”

The first core value of the Agile Manifesto is to value individuals and interactions over processes and tools.

The current status of the system architecture is not important because it will be split in another way in the future. It is important for us to know that there is no silver bullet when we decide how to build a flexible transaction system.

The current system still contains many unsolved problems. For example, we have to make transverse changes in the upstream and downstream directions after adding fields to the interface of a service. More embarrassing, we intended to split services for decoupling, only to introduce service release dependencies as the end result. System evolution is an ongoing process. Talent is the key to success because individuals and interactions are more valuable than processes and tools.

The past two years involved constant thinking and practice. The members of the transaction team often debate a range of problems, from changes to an interface field to domain boundary, and engage in extensive discussions to determine an appropriate technical solution. This reminds me of the Metaphysics of Quality explored in the book Zen and the Art of Motorcycle Maintenance: An Inquiry into Values. I came across a comment on the experience of a programmer. A programmer may write an excellent segment of code, but the code has always been there and the programmer just discovered it.

References

A Philosophy of Software Design by John Ousterhout
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values by Robert M.Pirsig
Domain-Driven Design by Eric Evans
Agile Software Development: Principles, Patterns, and Practices by Robert Cecil Martin
Clean Architecture by Robert Cecil Martin
Team Geek by Brian W. Fitzpatrick