How to Sustain a Growing Platform and Gain Online Users

  • At present, the number of daily active users, or DAU, of Xianyu exceeds 20 million, so appropriately supporting such a large number of users is a major test for the operation personnel of the platform.
  • Low operation efficiency: Due to a slow launch, it can be a long time before we can analyze performance after obtaining business data, and it can take even longer to make adjustments based on the data. Given this, only a few rule policies can be implemented in a year.

One Solution, a Rule Engine Based on Event Streams

To solve these issues our team turned to an engineering solution, one that involved a rule-engine based on event streams. For this, we first implemented a layer of business abstraction. The operation personnel analyzed and classified various user behaviors to obtain a common and specific rule and then applied the rule to users in real time as an intervention.

Limitations of the Rule Engine

The rule engine that we implemented could appropriately implement policies for user growth, so we quickly promoted it within Alibaba Group to other business modules and we had also planned to implement it as part of the security service integrated in the Xianyu platform. The following is a description of the consumer-to-consumer (C2C) security service. This security service aims to prevent and stop violations of Xianyu’s terms of use policy, including cutting down on inappropriate posts, as shown below.

A New Proposed Solution

Based on the limitations of the rule engine, we re-analyzed and organized our business scenarios. Then we designed a new solution and defined a new DSL based on the well-known general solutions in the industry. As you see, our syntax is SQL-like and we mainly take the following into considerations:

  • SQL is a simple language that can be easily learned.
  • The operation personnel of Xianyu are proficient in SQL, which can improve the launch efficiency.
  • Addition of time expressions: The WITHIN keyword is used to define a time window. When we use keywords such as DISTINCT following HAVING, aggregate computing can be performed for events in the time window. Our new solution can solve the preceding problem of rule description for the C2C business.
  • Enhanced scalability: Our new solution complies with industry standards and is unrelated to the input and output of specific businesses, which facilitates promotion.

Overall Layered Architecture

To facilitate engineering, we developed the following overall layered architecture in DSL, written based on the Event Programming Language (EPL): To rapidly achieve minimum closed-loop verification, we selected Blink as the cloud parsing and computing engine. Blink is an enhanced version of Apache Flink, optimized and upgraded by Alibaba.

  • Task delivery: This layer provides DSL statement and delivery capabilities for the business application layer and can be used to select target users and associate with the user outreach module.
  • User outreach: This module is used to receive computing results from the EPL engine and implement associated actions. This module can also independently provide services for business applications. Each business application can have its own logic and perform user outreach by using the user outreach module.
  • EPL engine: Currently, the EPL engine is already able to implement cloud parsing and computing. It can receive the DSL statements in task delivery and then parse and run the DSL statements on Blink.
  • Event collection: This module collects behavior events from the server logs and behavior tracking and then outputs the events to the EPL engine in a normalized manner.

Event Collection

The event collection module intercepts all network requests and behavior-tracking data based on aspects and then records the data in a server log stream. In addition, the event collection module cleanses the event stream by using a fact task to obtain desired events based on the format defined earlier. After that, the event collection module outputs the cleansed logs to another log stream for reading by the EPL engine.

EPL Implementation

Because we adopted an SQL-like syntax and because Apache Calcite is a common SQL parsing tool in the industry, we decided to use Calcite and customized a Calcite parser to perform parsing. For single-event DSL, Flink SQL is obtained. For multi-event DSL, the Blink API is directly called after parsing.

User Outreach

After generating computing results, the EPL engine outputs the results to the user outreach module. The user outreach module first selects an action route to determine the action to respond to. Then, the user outreach module delivers the action to a client by using a persistent connection with the client. After receiving the action, the client identifies whether the current user behavior permits the display of the action. If yes, the client directly implements the action and exposes it to the user. The user may perform a behavior after receiving the response. The behavior may affect the action route and provides feedback about the route.

Use Case

Since the new solution was launched, we have implemented it in an increasing number of business scenarios. Here are two examples.

Summary

This complete solution can significantly improve R&D efficiency. Generally, the original R&D process can be completed in four business days by writing code case by case. In extreme situations, if the client version needs to be updated, the process may take two to three weeks. However, when SQL is used, the R&D process only takes 0.5 business days. In addition, this solution has the following advantages:

  • High reliability: Based on the high reliability of Blink, this solution supported hundreds of millions of operations per second during the Double 11 Shopping Festival.
  • Have rules formulated by a strong operations team.
  • Can be expressed in SQL.

Future Plans

The current solution has the following disadvantages:

  • Xianyu’s business has maintained high growth for successive years. In the future, Xianyu may face user traffic that is three times our current level. If all computing is still completed on the cloud, this will pose a significant challenge to the computing power of the cloud.
  • The current design does not include algorithm access, but is simply used for selecting target users. To more accurately deliver rules and improve the effectiveness of the rules on users, we need to combine the solution with algorithms.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store