Step up the digitalization of your business with Alibaba Cloud 2020 Double 11 Big Sale! Get new user coupons and explore over 16 free trials, 30+ bestselling products, and 6+ solutions for all your needs!
I decided to write this article for the following reasons:
First, I have worked in the transaction field for four years and have many stories to tell. I want to keep a record of these stories.
Second, I find that many people only know that the transaction system is distributed, adopts a service-oriented architecture (SOA), and deals with millions or even tens of millions of data entries per day. They find it difficult to understand the thinking and causes behind the transaction system. Drawing on my years of experience, I want to enable everyone to easily understand the evolution process of the transaction system.
Third, I am writing this article to present the facts to you without irrelevant details, rather than giving a summary or describing a methodology. I will not avoid discussing the flaws of the system because it is not perfect. We have had to make many choices. Looking back now, I think some were lucky and some may have been mistakes.
This article is intended to present the whole process to you by describing some development stories and our thinking. I will try to provide a full description, rather than an overview.
Let’s start from the beginning in 2012.
The Origins of Ele.me
Before talking about the order system, let’s go back into “ancient history” and talk about the origins of Ele.me. In the earliest times, a system written in Python was developed. This system was named Zeus. The Zeus system contained several core modules of Ele.me, such as the order module, the user module, and the restaurant module. All these modules were integrated into a single code repository and deployed on the same machine. Besides the Zeus system, two core systems emerged: Ele.me for PC, which old hands call the “primary station”, and NaposPC for merchants. These systems communicated with each other over Thrift protocol. Except for this pipeline, all the messy internal features were provided by a system named Walle, which was written in PHP.
The following figure shows the Zeus system from that time.
Looking at the delivery history in Git, I can see the first commit for the order module was submitted by Yu Lixin on September 1, 2012. The commit message was “add eos service for zeus. currently only defined a simple get api”. EOS refers to the order system and stands for Ele.me Order Service. This name is still in use and has become a part of the order system for positive transactions today. Actually, this name was basically synonymous with the order team for a period of time.
Subsequently, Zeus was restructured to some extent and renamed Zeus2. I don’t know the exact date this happened.
In October 2014, I came to Ele.me for an interview. Mr. Lei, the head of the merchant client team, was the interviewer. On December 1, I officially joined Ele.me. An HR assistant led me, a complete newbie, to Mr. Lei. Mr. Lei took me to JN and said, “This is the intern.” Then, he turned around and disappeared. I later learned the story behind this: After the interview, Mr. Lei told JN that he had just interviewed an intern who was just barely qualified. It happened that the merchant client team was planning to transition to Java and JN needed an engineer proficient in Python. Then he assigned me to the merchant client team and JN invited him to dinner in return.
During the few months from December 2014 to April 2015, I worked with other employees to migrate the backend of an older big data system to Walis. After my mentor was transferred to the CI team, I took over the migration of Walis from standalone applications to distributed applications.
Creation of the Order Team
For me, it all comes down to luck.
In late April 2015, JN, my supervisor, suddenly came to me one day. He looked very excited and told me that the company planned to establish an order team. He was in charge of the order team and I was the only team member he selected. He told me why he had picked me, and it was clear that I had tricked him into thinking I was much more capable than I actually was.
As a technical engineer, I was excited. Previously, I had only heard of well-known terms such as high concurrency, high traffic, and “distributed,” but I had never expected that I would be able to work on such a system so soon. In addition, the system we previously developed was so marginal that it barely received any requests during the day. In the evenings, the Business Development (BD) team returned to the company after visiting merchants, so traffic peaks happened to occur in the evenings. However, even in the evenings, no more than a dozen or so requests were received on a single key port. The system could take two hours to get its first user and sometimes did not have any requests for a whole day. At that time, we were happy to leave work before 7 p.m. When we first launched the system, we were sternly informed that we might have to work overtime until 8:30 p.m.
The reason why JN was chosen as the leader of the order team was that he was very familiar with all of the company’s systems and businesses and was qualified to develop an order system, even though he started as a frontend engineer and worked on a marginal backend system.
At first, JN and I were the only members of the team. I still hadn’t graduated at that time. I was excited but mostly uneasy.
On May 12, 2015, the order team was officially established. On that day, ZH from a neighboring team joined us. ZH worked on PHP and was originally assigned to work on the Walle system. In addition, our department supervisor came to us and said a nice young Java engineer had happened to join our company on that very day and we could ask him to join the order team. As a result, on the day the team was created, the number of team members doubled to four.
The first task we gave ourselves was to read the code, sort out the businesses, and draw up diagrams. We applied to the CTO for a one-month buffer period, during which we did not have to accept any business requests.
In addition, we invited an engineer responsible for frontend order processes, an engineer responsible for the Python framework, and an engineer responsible for application O&M in the Zeus system to give lectures to us. In fact, the lectures given by these engineers lasted only about 1 hour each. In that month, we read tens of thousands of lines of Python code line by line, without any product documentation and very few annotations. Each person was responsible for explaining a part of the code. Finally, I drew a big diagram to present the overall lifecycle, key operations, and key business logic of the order system. Subsequently, we used this diagram for more than a year.
In fact, the product R&D department of Ele.me had grown to hundreds of engineers by the middle of the year. Xuefeng, then the new CTO, joined Ele.me at the beginning of the year. Infrastructure construction started in the second half of 2015 and the entire order system was rapidly built in 2016.
For the order team, this as a period of both chaos and rapid growth. We joked that we were changing tires while driving a sports car.
Decoupling the Zeus System
The first major task that closely involved the order team was decoupling the Zeus system. This task started around June. HC, one of the technical experts I respected most, was in charge of the Python framework. At the QCon conference just held in the United States, as the chief architect, he introduced the overall technical architecture of Ele.me. As I mentioned previously, the Zeus system was a giant standalone application. To allow every part to develop rapidly in the future and reduce the coupling and mutual influence of different components, the company started a Zeus decoupling project. In short, the project aimed to split up the Zeus system.
After more than a month of intensive meetings, the splitting plan was finalized. The discussions were fierce, especially when it came to naming owners of different services after decoupling and drawing the boundaries between services. I was still not qualified to take part in these discussions.
In the end, it was decided that we would split the Zeus system into the following main services:
zeus.eos => Order service
zeus.eus => User service
zeus.ers => Merchant service
zeus.eps => Marketing service (new product)
zeus.sms => SMS service
After the Zeus system was split into services, each service was further restructured and split up. For example, biz.booking was separated from zeus.eos, taking with it the order placement and shopping cart capabilities. In addition, biz.ugc was separated from zeus.eos, taking order evaluation capabilities.
The splitting work can be divided into the following major phases:
- In July, we created a shared code repository and divided the code into modules that could run independently. Specifically, all the code of the Zeus system was packaged to the server and then divided into modules. On a specified machine, only a specified module was started and a specified port was enabled.
- In August, we added a proxy to the to-be-migrated API of the original service. The proxy could be connected to the API of the new service. The service registry controlled the migration traffic.
- From August to early September, we split and transformed all the scripts and modules.
- In September, we made the code repository independent. With Git’s powerful filter-branch feature, we completely separated the code and change history in the modules from the original code repository. However, the system still adopted hybrid deployment. In the release tool, releasing an independent application actually meant replacing a directory under the large Zeus project.
- In September, we made configurations independent. Originally, configurations were flushed to the server by saltstack and shared by multiple applications on the server. After optimization, we could obtain the configurations of an application by using the configuration delivery capability of the service registry. In this phase, the system basically transitioned to soft loads.
- In March 2016, we achieved independent physical deployment. This belongs to phase 2 of the decoupling task.
This splitting phase also brought about another product: zeus_core, the SOA framework of Python. Specifically, zeus_core was separated from the system around April, before the business services.
In the end, phase 1 of the decoupling task lasted about half a year. During this period, we did not cause any accidents and almost no emergencies occurred. Looking back now, I would say that this success was due to the efforts and good habits of the engineers and O&M personnel. We did not have any advanced skills or tools and did not have a full-time test developer.