By Cheng Zhe, nicknamed Lanhao at Alibaba.
In this article, we are going to be looking at experiments done at Alibaba, which were intended at helping increase online users on its second-hand buy-and-sell platform, Xianyu, which literally translates to “Idle Fish.” But, before we get ahead of ourselves, let’s discuss some of the dynamics of the buy-and-sell platform of Xianyu:
- As a buy-and-sell platform, sellers on the Xianyu platform generally are individual consumer sellers, rather than commercial store sellers. Therefore, it is difficult to organize sellers into marketing promotions in a unified manner.
- At present, the number of daily active users, or DAU, of Xianyu exceeds 20 million, so appropriately supporting such a large number of users is a major test for the operation personnel of the platform.
In early 2019, the Xianyu team at Alibaba conducted multiple experiments on user growth, including the following two experiments, shown in the figures below:
The team conducted the preceding two experiments with the aim of retaining users on Xianyu for a longer time. The longer time that users spend on browsing on Xianyu, the more likely they are to discover interesting content, including products and posts in our various curated item groupings, or what are referred to in Chinese as “fish ponds.” As such, users may be attracted to return to Xianyu at some later time, and Xianyu can achieve a greater level of user growth. Most of the experiments we conducted produced good business results. However, two problems were also found with these experiments:
- Long research and development period: In the beginning, our team used the fastest implementation solution in order to quickly verify the effectiveness of rule policies. We did not use big and comprehensive designs, but met each requirement by writing code case by case. As such, the period from development to launch may be a lengthy three weeks, mainly because this is the time window for releasing new versions of the client.
- Low operation efficiency: Due to a slow launch, it can be a long time before we can analyze performance after obtaining business data, and it can take even longer to make adjustments based on the data. Given this, only a few rule policies can be implemented in a year.
One Solution, a Rule Engine Based on Event Streams
To solve these issues our team turned to an engineering solution, one that involved a rule-engine based on event streams. For this, we first implemented a layer of business abstraction. The operation personnel analyzed and classified various user behaviors to obtain a common and specific rule and then applied the rule to users in real time as an intervention.
We engineered the business abstraction layer to have improved R&D efficiency and operation efficiency. To this end, we developed the first solution, which was a rule engine based on event streams. We took user behavior as being a series of sequential behavior event streams. We can define a complete rule by using a simple event description in Domain Specific Language (DSL) and then incorporate input and output definitions.
Let’s look at the second experiment in user growth as an example. This example can be briefly expressed in DSL as shown in the following figure.
Limitations of the Rule Engine
In the C2C security service, a rule abstraction operation, which is obtained from a series of behaviors.
Despite our best efforts, these security rules could not be used in the rule engine. Consider this example rule, for instance. If a user is blacklisted twice within one minute, this user will be marked with a high-risk tag. When the first blacklisting event occurs, the rule engine matches the event. Then when the second blacklisting event occurs, the rule engine also matches this event. As such, the rule should be met from the perspective of the rule engine and subsequent operations can be performed. However, one important aspect is that the blacklistings should be performed by two different users to prevent one user from maliciously blacklisting another with multiple devices.
However, this is difficult for the rule engine to discover, as the rule engine only knows that two blacklisting events are matched and the rule is met. This is because the rule engine can match only stateless events and cannot trace back the details of these events for further aggregate computing purposes.
A New Proposed Solution
Based on the limitations of the rule engine, we re-analyzed and organized our business scenarios. Then we designed a new solution and defined a new DSL based on the well-known general solutions in the industry. As you see, our syntax is SQL-like and we mainly take the following into considerations:
- SQL is a fully-semantic programming language that does not require additional syntax design.
- SQL is a simple language that can be easily learned.
- The operation personnel of Xianyu are proficient in SQL, which can improve the launch efficiency.
Compared with the previous rule engine, the new DSL solution has the following strengths:
- Addition of conditional expressions: More rich and complex event descriptions are supported, and more business scenarios are supported.
- Addition of time expressions: The
WITHINkeyword is used to define a time window. When we use keywords such as
HAVING, aggregate computing can be performed for events in the time window. Our new solution can solve the preceding problem of rule description for the C2C business.
- Enhanced scalability: Our new solution complies with industry standards and is unrelated to the input and output of specific businesses, which facilitates promotion.
The example below shows how our new solution resolves the problem we discussed in the previous section.
Overall Layered Architecture
To facilitate engineering, we developed the following overall layered architecture in DSL, written based on the Event Programming Language (EPL): To rapidly achieve minimum closed-loop verification, we selected Blink as the cloud parsing and computing engine. Blink is an enhanced version of Apache Flink, optimized and upgraded by Alibaba.
The layered architecture comprises the following layers from the top down:
- Business application: This layer is the business end of the entire system and has been implemented in multiple business scenarios.
- Task delivery: This layer provides DSL statement and delivery capabilities for the business application layer and can be used to select target users and associate with the user outreach module.
- User outreach: This module is used to receive computing results from the EPL engine and implement associated actions. This module can also independently provide services for business applications. Each business application can have its own logic and perform user outreach by using the user outreach module.
- EPL engine: Currently, the EPL engine is already able to implement cloud parsing and computing. It can receive the DSL statements in task delivery and then parse and run the DSL statements on Blink.
- Event collection: This module collects behavior events from the server logs and behavior tracking and then outputs the events to the EPL engine in a normalized manner.
The event collection module intercepts all network requests and behavior-tracking data based on aspects and then records the data in a server log stream. In addition, the event collection module cleanses the event stream by using a fact task to obtain desired events based on the format defined earlier. After that, the event collection module outputs the cleansed logs to another log stream for reading by the EPL engine.
Because we adopted an SQL-like syntax and because Apache Calcite is a common SQL parsing tool in the industry, we decided to use Calcite and customized a Calcite parser to perform parsing. For single-event DSL, Flink SQL is obtained. For multi-event DSL, the Blink API is directly called after parsing.
After generating computing results, the EPL engine outputs the results to the user outreach module. The user outreach module first selects an action route to determine the action to respond to. Then, the user outreach module delivers the action to a client by using a persistent connection with the client. After receiving the action, the client identifies whether the current user behavior permits the display of the action. If yes, the client directly implements the action and exposes it to the user. The user may perform a behavior after receiving the response. The behavior may affect the action route and provides feedback about the route.
Since the new solution was launched, we have implemented it in an increasing number of business scenarios. Here are two examples.
From the above example of Fish Ponds, we can see that this solution is somewhat like algorithm recommendation. In the preceding rental example, the rule is too complex and it is difficult to express the rule in DSL. Therefore, the rule is configured to collect only four browses of different houses for rental. After the rule is triggered, the collected data is provided for the business team that developed the house rental application. This is also the boundary we found during the implementation.
This complete solution can significantly improve R&D efficiency. Generally, the original R&D process can be completed in four business days by writing code case by case. In extreme situations, if the client version needs to be updated, the process may take two to three weeks. However, when SQL is used, the R&D process only takes 0.5 business days. In addition, this solution has the following advantages:
- High performance: The end-to-end computing process takes five seconds on average.
- High reliability: Based on the high reliability of Blink, this solution supported hundreds of millions of operations per second during the Double 11 Shopping Festival.
By implementing this solution in multiple businesses, we found its appropriate boundaries. This solution is applicable to businesses that:
- Have high real-time requirements.
- Have rules formulated by a strong operations team.
- Can be expressed in SQL.
The current solution has the following disadvantages:
- Xianyu’s business has maintained high growth for successive years. In the future, Xianyu may face user traffic that is three times our current level. If all computing is still completed on the cloud, this will pose a significant challenge to the computing power of the cloud.
- The current design does not include algorithm access, but is simply used for selecting target users. To more accurately deliver rules and improve the effectiveness of the rules on users, we need to combine the solution with algorithms.
Therefore, in the future, we will focus on exploring real-time computing capabilities on the client and the integration of algorithm capabilities.