Ant Financial’s Innovations and Practices in Online Graph Computing
By Pan Zhenxuan, nicknamed Taichu at Alibaba.
Over the past 15 years, Ant Financial has reshaped how people pay for things in China and transformed the lives of countless people, now having served more than 1.2 billion people around the world. However, these achievements have very much depended on the support of the technologies involved. During the 2019 Apsara Conference held in Hangzhou, Ant Financial shared its technical experience and knowhow from the past 15 years, discussing the company’s future-oriented innovations in financial technologies with the participants.
When attending the Second Belt and Road Forum for International Cooperation in April this year, Jing Xiandong, Chairman and CEO of Ant Financial, said that through nine years of practice, Ant Financial has improved the financing channels for small- and medium-sized enterprises and developed “3–1–0” online lending, that is, a service standard characterized by a 3-minute application process, and 1-second loan granting and all with 0 manual intervention. Even as recently as two years ago, some Chinese netizen grumbled that the “3–1–0” online lending model referred to a 3-week application, 1-month review, and 0 chance of receiving a loan.
So, how did Ant Financial provide convenient financing for small- and medium-sized enterprises by exploring and applying new financial technologies? How does Ant Financial review loan applications within one second? Let’s talk about one of the key technologies, online graph computing.
Online graph computing integrates stream computing with graph computing to implement graph computing in real time. After years of development in this direction, Ant Financial has made groundbreaking achievements in key technologies, and has developed targeted solutions for financial scenarios.
Ant Financial’s Online Graph Computing Scenarios
Ant Financial developed online graph computing technology to solve problems that occur in financial scenarios.
For example, the real-time anti-cashout scenario in financial risk control. Cashout refers to the act of obtaining cash benefits through illegal or fraudulent transactions. This sort of illegal act is often done with credit cards or other accumulated funds.
Ant Credit Pay (known as Huabei in Chinese) is a credit-based product. Users can use Ant Credit Pay to make purchases, and make repayments on a regular basis. Ant Credit Pay’s anti-cashout is a key piece in their overall risk control, and accurately identifying cashout activity in real time is crucial to Ant Credit Pay’s anti-cashout mechanism.
As shown in the figure, the cashout client pays for a transaction by using Ant Credit Pay, and after a series of complex capital flows, ultimately performs cashout by transferring the money back.
Data modeling abstracts the overall capital flow into a capital relationship network, and identifies and analyzes subgraphs based on the capital relationship network to effectively identify cashout activity.
By using real-time anti-cashout examples, we can obtain three basic requirements for online graph computing:
- First, the construction of a financial-grade capital network. Based on the online capital activity of users, a highly reliable, financial-grade capital network is built in real time.
- Second, analysis and decision-making based on real-time subgraphs. Based on a highly reliable financial-grade capital network built in real time, real-time analysis and computing can be performed based on subgraphs, and decisions can ultimately be made online.
- Finally, the construction of a dynamic subgraph network. In the process of analyzing a subgraph, it is necessary to dynamically build and expand subgraphs based on the user’s capital network activity.
Let’s talk about the familiar Ant Forest scenario.
Ant Forest has multiple forms of friend interactions. For example, Alipay friends can collect energy from each other, and friends and relatives can plant a tree together.
In Ant Forest, there are relationships between people, and between people and trees. It not only needs to support real-time relationship data construction, but also has requirements such as support for high-performance and low-latency relational data querying and consistent relationship data modification.
Through this Ant Forest example, we can also find three other requirements for online graph computing.
- First, support for millions of concurrent requests is required. Ant Forest currently has more than half a billion users, and needs to support millions of QPS and millisecond-level responses.
- Second, the storage of trillions of items of relationship data is required. Because Ant Forest has a large data scale, including user relationships, tree planting relationships, and joint planting relationships, there are many, complicated relationships. As a result, it needs to provide a graph storage capability for trillions of records.
- Finally, consistency is required. For example, in the process of transferring green energy between friends, due to the large amount of concurrency, it is necessary to ensure strong consistency in the real-time update process.
Based on the previous examples of real-time anti-cashout and Ant Forest, we will briefly summarize the requirements for financial-grade online graph computing.
The requirements are divided into two parts. The first part is function requirements:
- First, it needs to provide the ability to store massive amounts of graph data.
- Second, it provides low-latency I/O access to the stored massive graph data.
- Finally, after the subgraph relationship is obtained through low-latency I/O, various modes such as stream computing and graph computing are integrated.
The second part is stability requirements. Stability is especially important due to the characteristics of financial scenarios. Here there are two main points:
- First, it must support financial-grade disaster recovery and high reliability, such as fault recovery in five data centers across three regions.
- In addition, it also needs to support low-cost and auto scaling.
Overall Architecture of Ant Financial’s Online Graph Computing
After introducing the scenarios and requirements of Ant Financial’s online graph computing application, let’s learn about the core technologies and considerations of Ant Financial’s online graph computing application.
First, let’s take a look at the overall architecture of Ant Financial’s online graph computing application.
At the top layer, Ant Financial’s online graph computing application provides a unified graph development platform. Based on the unified graph development platform, users can develop jobs based on relational metadata and unified DSL. Currently, unified DSL is developed by combining SQL and Gremlin. At the same time, based on the user’s DSL, a distributed DAG is built in real time to perform operations.
After construction is complete, the distributed jobs will process online log data and event data in real time. There are two processes involved. One process writes log data and event behaviors processed in real time to a high-performance graph database, and builds a high-performance graph cache in real time based on the data, providing fast and efficient subgraph extraction.
The other process dynamically determines whether graphic traversal and computing is needed based on the data processed in real time. Additionally, in the computing process, it quickly extracts subgraphs for computing from the high-performance cache, and then outputs the computing results for online use.
The preceding architecture diagram shows that Ant Financial’s financial-grade online graph computing has three major technical directions:
- First is the computing capability provided by integrating stream computing and graph computing. Stream computing and graph computing must be integrated in one system to implement comprehensive online graph computing.
- The second is the graph cache with high compression ratio. Through high compression, all graph data can be stored in the memory, enabling fast and efficient subgraph extraction.
- Last is the financial-grade massive graph database. The financial-grade massive graph database is used to implement storage of massive amounts of relationship data and high reliability in financial scenarios.
Next, let’s introduce the core technical points of these three aspects.
Integration of Stream Computing and Graph Computing in Online Graph Computing
The first key technology is stream-graph integrated computing.
Take the case of Ant Credit Pay’s anti-cashout scenario as an example. Not every transaction or repayment requires the identification of cashout, so some guidelines must be established first. For example, based on real-time statistics on the number of transactions or the amount of money returned, iterative computing of subgraphs starts only after certain conditions are met. Finally, the results of iterative computing based on graphs are processed before being provided for online use. Therefore, a complete computing process requires the integrated computing of stream computing and graph computing.
In traditional computing methods, stream computing and graph computing are combined to implement the entire computing process. For example, systems such as Flink, GraphX, and Neo4j are used together to achieve this purpose. However, with traditional solutions, users need to learn multiple systems and maintain multiple systems for integration. The integration of multiple systems results in additional data storage and latency. Therefore, here at Ant Financial we broke through the boundaries of the computing mode, integrating stream computing with graph computing in one system to provide stream-graph integrated computing system. Users can use a set of APIs and one computing system to implement the stream-graph integrated computing process. As only one system is required, it can also reduce a user’s O&M costs.
Similarly, in the case of Ant Credit Pay’s anti-cashout, after the integration of stream computing and graph computing is realized, data can be used to make dynamic decisions on whether to perform graph computing and what graph computing algorithms are used. This cannot be set in advance. Therefore, the traditional static datacenter activation coordination (DAG) cannot meet the current needs. Here, we combine the data stream with the control flow and provide the dynamic DAG capability to implement on-demand computing and scale elastically.
After the provision of integrated computing and dynamic computing capabilities, it is particularly important for users to be able to perform fast and convenient development.
Here, we use SQL Plus (Gremlin) for integrated computing development. Users can build the overall pipeline process by using SQL. At the same time, the concept of graph view is introduced: a graph view can be built offline and in real time based on SQL. Based on graph view, you can use SQL and Gremlin to implement data-driven online graph computing.
At the same time, most SQL developers are familiar with it, which can reduce the threshold for users to learn, develop, and debug. Ant Financial is currently paying attention to the latest international standard graph query language (GQL) for graph querying, which will also be integrated in the future.
Based on these three features, you can quickly build an integrated stream-graph job.
Then, when you launch an algorithm policy, you need to verify the policy. You need to perform effective simulation of stream data to determine whether the current algorithm is effective.
Based on the current online graph computing architecture, we can effectively play back historical graph data and requests through effective abstraction of the model to implement simulated integrated architecture capabilities online.
Meanwhile, due to the features of simulation, historical data snapshots are accessed, causing accelerated expansion of graph data storage. In this example, because stream playback is used for simulation, we can prevent excessive data bloat by using a data-driven garbage collection (GC) policy. In addition, the multi-level cache policy is used to improve the throughput of graph simulation.
The preceding are the four key technologies of integrated computing. With these four features, users can efficiently and conveniently build and compute jobs.
Next, we will discuss how to quickly build a subgraph.
High-performance Graph Cache of Online Graph Computing
Here, we will focus on Ant Financial’s high-performance graph cache, which caches data in memory for online service based on perfect hash functions and professional compression capabilities. This enables low latency and high throughput in subgraph scenarios.
Memory usage is particularly important for high-performance graph caching. Here, we can take a look at a diagram showing the compression ratio comparison between the Ant Financial graph cache and similar industry systems. Here, we used the Twitter-based user-follower open dataset.
As you can see, due to the association characteristics of graphs, open-source systems in the industry have been enlarged to a certain extent on the basis of the original graph. However, TigerGraph has about 40% of the memory usage relative to the original size. Ant Financial’s graph cache can achieve 20% of the original memory size. It is the best choice in the industry and reduces memory usage by half.
By using a high compression ratio, we store the graph data completely in the memory. Next, we can compare the RT performance of subgraph extraction.
Also based on Twitter’s user-follower data, we can see that in the 1st degree, 2nd degree, and 3rd degree scenario, especially the 1st degree scenario, the overall latency is only about 20% of that of TigerGraph.
With high-performance graph cache and high-performance querying for subgraphs, online real-time comprehensive computing can be implemented.
GeaBase, a Financial Graph Database for Online Graph Computing
Next, let’s take a look at how to meet the requirements for high reliability and consistency in financial scenarios. Here we focus on Ant Financial’s financial-grade graph database GeaBase.
GeaBase internally implements data sharding by implementing micro shards. In addition, data sharding based on micro shards implements cost-based data migration and automatic load balancing. At the same time, with the help of Ant Financial’s architecture system, the city-level disaster recovery strategy of five centers across three regions has been implemented.
Through comparison, we can see that GeaBase provides better disaster recovery capabilities than HBase in cases of single-host, single-replica, or data center-level faults. Additionally, GeaBase can recover and prevent data loss in city-level disaster recovery scenarios.
GeaBase also implements the consistency capability to meet the requirements for strong data consistency in financial scenarios. GeaBase implements data consistency by implementing the Raft protocol. It also allows the business to choose between final consistency or strong consistency based on its own business needs.
Ant Financial’s online graph computing is widely used in multiple business lines of Ant Financial. It supports more than 100 business scenarios, such as risk control, social networking, and marketing. Currently, Ant Financial has a cluster of more than 2000 machines running 7/24 hours.
At the same time, Ant Financial has gradually exposed these capabilities to external financial-level customers, such as Changshu Rural Commercial Bank and Tailong Commercial Bank. For example, in Changshu Rural Commercial Bank, Ant Financial and Changshu Bank jointly developed the Yi Yan Knowledge Graph project. Based on the online graph computing technology developed by Ant Financial, the project uses massive association analysis in a big data volume environment to perform pre-judgment of the guarantee relationship. This implements second-level pre-judgment and risk alert for the guarantee circle, effectively controlling risk and improving the efficiency of credit approval. Ant Financial has not only achieved the “3–1–0” online lending, but has also provided this capability to its customers through cooperation to strive towards inclusivity in finance.
In the future, Ant Financial will continue to improve on the technical capabilities of online graph computing, opening it to more partners to promote the use of online graph computing in more scenarios.