Application of Apache Flink in Real-time Financial Data Lake

13 min readJun 17, 2021

By Bai Xueyu, R&D engineer of Zhongyuan Bank

This article is written by Bai Xueyu, an R&D engineer of a big data platform for Zhongyuan Bank. It mainly introduces the application of a real-time financial data lake in Zhongyuan Bank, including:

Background
Real-time financial data lake architecture
Scenario practices

Background

Located in Zhengzhou city, Henan province, Zhongyuan Bank is the only provincial corporate bank and the largest city commercial bank. It was successfully listed in Hong Kong on July 19, 2017. Since its establishment, Zhongyuan Bank has taken the strategy of strengthening and developing through technology to be more technology and data oriented. Zhongyuan Bank engages in and advocates technology by adopting technology-based measures.

This article introduces from the perspectives of backgrounds, architectures, and scenario practices of the real-time financial data lake.

Business Background of Data Lake

Changes in Decision-Making Model

Today, the decision-making models of banks are facing great changes.

First of all, traditional bank data analyses mainly focus on the bank’s income, cost, profit distribution and response to the supervision of the regulatory authorities, which is very complex but with certain rules. It belongs to the financial data analysis. With the continuous development of Internet finance, the bank’s business has been continuously squeezed and can no longer meet the business needs if it remains stagnant.

Today, it is imperative to do more targeted marketing and decision analyses through better customer understanding and massive data collection. Therefore, the bank’s decision-making model gradually switches from the traditional financial analysis to the KYC-oriented (Know Your Customer) analysis.

Secondly, the traditional banking business mainly relies on business personnel to make decisions to meet the development needs of the business. However, as the banking business develops, various applications have generated massive data with multi types. Business personnel alone cannot meet business needs. Facing complex problems and increasing influencing factors, more comprehensive and intelligent technology measures are expected. Therefore, banks need to transform the decision-making model from purely based on business personnel to more on machine intelligence.

Problem Analysis

The most typical feature of the big data era is its large amount and types of data. Various technologies are involved in the use of massive data, including

Traditional financial-oriented and off-line data analysis
Non-financial-oriented data analysis
Frequent changes in events and logs
Real-time data analysis

Diversified digital marketing methods are needed to paint a more comprehensive, accurate and scientific customer profile. Furthermore, real-time risk decision-making technologies are required to monitor risks in real time and multi-mode data processing technologies to effectively support different types of data, including structured data, semi-structured data and unstructured data. Moreover, machine learning and AI technologies are also needed to support intelligent problem analyses and decision-making.

Various technologies and decision-making scenarios driven by data lead to a huge change, switching from the traditional financial-oriented and offline data analysis to the customer-oriented and real-time data analysis.

Technical Background of Data Lake

In the banking system, the traditional data warehouse architecture based on standardized and precise processing can better solve scenarios in financial analyses. It will remain the mainstream scheme for a long time.

Traditional Data Warehouse Architecture

The following figure shows the traditional data warehouse architecture. There are the basic Operational Data Store (ODS), the integrated layer of common data, the data mart layer, and the application processing layer from the bottom up. Different layers perform massive operations in batches every day to obtain the desired business results. Banks have long been relied on the traditional data warehouse system because it provided excellent financial analysis solutions, featured by obvious advantages:

Precise and standardized processing
Multi-layer data processing
Unified caliber
T+1 data processing
High performance
Accumulated experience
Suitable for financial analysis

However, a traditional data warehouse system also has its shortcomings. This include:

Difficult to change
High unit storage cost
Not suitable for frequently changing and highly real-time data, such as massive logs and events
Poor compatibility for semi-structured data and unstructured data

To sum up, the traditional data warehouse architecture has both advantages and shortcomings and will exist for a long time.

Changes in Data Warehouse Architecture

The analysis based on KYC and machine intelligence needs to support data of multiple types and multiple time-effectiveness with agility. Therefore, a new architecture complementary to the data warehouse architecture is required.

Features of the Real-time Financial Data Lake

Now, it comes to the theme of this article, the real-time financial data lake. It has three main features:

Openness: It supports multiple scenarios, such as AI, unstructured data, and historical data.
Timeliness: It equips with effective architectures for real-time analyses and decision-making.
Integration: It integrates with the data warehouse architecture of the bank to unify the data view.

The overall real-time financial data lake is integrated. Its integration concepts are as follows.

First, it integrates data, including structured, semi-structured and unstructured data. Second, it integrates technologies, including cloud computing, big data, data warehouse, stream computing and batch processing technology. Third, it integrates designs. Flexible theme designs of data models include Schema-on-read and Schema-on-write models, as well as multi-dimensional and relational data models.

Fourth, it integrates data management, which enables metadata management unification in data lakes and data warehouses for a unified user development experience. Fifth, it integrates physical locations. It can be a single large cluster in a physical set, or a logical cluster in a physical-scattered logical set.

Sixth, it integrates data storage. A technical platform for unified data storage meets the requirements for data migration, with storage and O&M costs reduced.

Real-time Financial Data Lake Architecture

Functional Architecture

Its functions cover data sources, unified data access, data storage, data development, data services, and data applications.

Data source: It supports structured data, semi-structured data and unstructured data.
Unified data access: With a unified data access platform (UDAP), intelligent data access can be realized according to different data types.
Data storage: It includes a data warehouse and data lake to implement an intelligent distribution of both hot and cold data.
Data development: It includes task development, task scheduling, O&M monitoring and visual programming. Data service: It includes interactive query, data API, Structured Query Language (SQL) quality assessment, metadata management, data lineage management.
Data application: It includes digital marketing, digital risk control, digital operations and customer profile.

Logical Architecture

The logical architecture of the real-time financial data lake mainly includes the storage layer, computing layer, service layer, and product layer.

At the storage layer, MPP data warehouses and data lakes that are based on OSS or HDFS allow intelligent storage management.
At the computing layer, a unified metadata service is implemented.
At the service layer, federated data computing and data service API are available. The former is a federated query engine for cross-database queries. It relies on the unified metadata services and queries data in data warehouses and data lakes.
At the product layer, intelligent services are provided, including packet RPA, license identification, language analysis, customer profile, and smart recommendation. Business analyses are also available, including self-service analyses, customer insights, and visualization. Moreover, data development services are accessible, including data development platform and automated governance.

Practices of Real-time Financial Data Lake Projects

The practices are mainly tailored to real-time structured data analyses. As shown in the following figure, the open-source architecture consists of four layers, including the storage layer, table structure layer, query engine layer and federated computing layer.

The storage layer and table structure layer make up an intelligent data distribution, which supports semantic guarantees of Upsert, Delete, Table Schema and ACID. It also is compatible storage of semi-structured and unstructured data.

The query engine layer and federated computing layer make up a UADP. The one-stop data development platform enables development of real-time and offline data tasks.

This article mainly introduces the development of real-time data risks. The one-stop stream computing development platform will be described later, which enables the development, management and O&M of real-time tasks to guarantee stable operation.

Stream Computing Development Platform

Why do banks need a stream computing development platform? What are its advantages?

Advantages of a Stream Computing Development Platform

The stream computing development platform can effectively lower the threshold for real-time data development and facilitate its business development. This platform offers a one-stop and real-time data development platform that allows developments of real-time data tasks, including visualized data development and task management, multi-tenant and multi-project management, unified O&M management and permission management. The stream computing platform is developed based on Flink SQL, which is born with productivities.

Through the continuous application of Flink SQL, the capabilities of the stream computing development platform can be divided into branch banks. They can independently develop real-time data tasks based on their own business requirements for further enhancement.

Architecture

The following figure shows the architecture of the stream computing development platform, including data storage, resource management, computing engines, data development and Web visualization.

It allows multiple-tenant and multi-project management, as well as O&M monitoring for real-time tasks. This platform enables to management of resources on both physical and virtual machines, and supports a unified cloud base, Kubernetes. The computing engine is based on Flink, which provides data integration, real-time task development, O&M center, data management and visual data development of integrated development environment (IDE).

“Straight-through” Real-time Scenarios

Specific scenarios are described below. Firstly, it’s the “straight-through” architecture for real-time scenarios.

Data from different data sources is wired to Kafka in real time. Then, Flink reads Kafka data for processing and sends the results to the business end. The business side can be Kafka or a downstream side such as HBase. Dimension table data is stored through Elastic. The “straight-through” architecture that realizes T+0 data timeliness is mainly used in real-time decision-making scenarios.

Real-time Decision-making Analysis

Take calling in loans coming due as an example. The business depends on account balance, transaction amount and current balance. To make decisions based on the three types for different businesses, whether to choose a message, intelligent voice, or phone collection?

If the architecture is based on the originally offline data warehouse architecture, the obtained data is T+1. Making decisions based on expired data causes the situation that one customer may have already made the payment, but still faces phone collection. However, the application of the straight-through architecture can realize T+0 account balance, transaction amount and current balance, making real-time decisions to improve user experience.

Real-time BI Analysis

Some keywords are needed to meet requirements. For example, “real-time acquisition” is required to obtain wealth management product sales information from the past period to the present in real-time. This means that T+0 data is required. “From the past period to the present” involves queries of historical data. The sales information of wealth management products is related to the banking business, which is generally complex and requires multi-streaming join.

In this case, this is a real-time BI requirement, which cannot be effectively solved with a “straight-through” architecture. Because this architecture is based on Flink SQL, Flink SQL cannot effectively handle historical data queries. And banks often run complex businesses. Therefore, a double-streaming join is needed. To solve this problem, a new architecture is expected, which should be different from the “straight-through” architecture for real-time scenarios.

Floor-to-floor Real-time Scenario

After data sources are accessed to Kafka, Flink can process data on Kafka in real time and migrate processing results to data lakes. Based on open-sourced construction, data in data lakes is stored through HDFS and S3, and the tabular solution is Iceberg. Flink reads data from Kafka and processes data in real time. At the same time, Flink can write the intermediate results into data lakes and process them step by step to obtain the desired results. Then these results can be connected to applications through query engines, such as Flink, Spark, and Presto.

Real-time Financial Data Lake

Architecture

The real-time financial architecture of Zhongyuan Bank includes “straight-through” and “floor-to-floor” real-time financial scenarios. Data is wired to Kafka in real time. Flink reads data from Kafka in real time and processes. The dimension table data is stored in Elastic.

There are two cases:

When the business logic is simple, Flink reads event data from Kafka and dimension table data in Elastic in real time. The processing results are directly sent to the business.
When the business logic is complex, the intermediate result is written to data lakes first and then processed step-by-step to obtain the final results. And these results are connected to different applications through query engines.

Data Flow Direction

This figure shows the data flow direction of the real-time financial data lake. All data sources come from Kafka and Flink SQL reads the data in real time through Extract-Transform-Load (ETL). By docking application through ETL and the data lake platform of real-time data, real-time and quasi-real-time output results can be generated. Real-time data ETL refers to the “straight-through” architecture for real-time scenarios, while the data lake platform to the “floor-to-floor” architecture for real-time application scenarios.

Features of the Real-time Financial Data Lake

Openness: The data lake is compatible with complex SQLs and supports many financial scenarios.
Timeliness: The data lake supports real-time and quasi-real-time data analyses and processing, and allows two ways of application docking, namely floor-to-floor and non-floor-to-floor.
Integration: With a financial data lake architecture, the data lake supports stream-batch analyses and processing of structured data. The data lake also supports semi-structured and unstructured databases because of distributed storage.

Achievements

Through continuous construction, a series of achievements have been made. Now T+0 data timeliness is available, supporting more than 20 financial products and reducing storage costs by 5 times.

Best Practices: Intelligent, Real-time, Anti-Fraud

The real-time financial data lake is mainly used in real-time BI and real-time decision-making. A typical application of real-time decision-making is the intelligent, real-time and anti-fraud business, which relies on real-time computing platforms, Knowledge Graph platforms, machine learning platforms, and real-time data models. It provides a series of data services, including relationship fraud, device fingerprinting, behavior monitoring, location resolution and generic match to support anti-fraud transactions, anti-fraud applications and anti-fraud marketing scenarios.

Currently, 1.4 million risk data records are processed in real time daily, with 110 real-time blocks and 108 real-time alerts per day.

Best Practices: Real-time BI

Take a customer real-Time insight platform as an example. It’s internally called Zhiqiu platform, which relies on the real-time computing platform, knowledge graph platform, customer profile platform and intelligent analysis platform. Different platforms combine together to provide interactive query services, unified metadata management services, SQL quality assessment services, configuration development services, unified visual data display, and so on. It supports scenarios such as trend analysis, circle analysis, retention analysis and customer group analysis.

Now common requirements and services for real-time analysis scenarios can be interconnected to achieve closed-loop visualization of the real-time BI analysis. The branch banks can independently perform the digitized real-time BI analysis. 26,800 real-time BI analysis cases have been implemented, with an average of more than 10,000 monthly active users. Assisting analyses of real-time BI requirements are more than 30,000 per day.

Application of Apache Flink in Real-time Financial Data Lake

Background

Business Background of Data Lake

Changes in Decision-Making Model

Problem Analysis

Technical Background of Data Lake

Traditional Data Warehouse Architecture

Changes in Data Warehouse Architecture

Features of the Real-time Financial Data Lake

Real-time Financial Data Lake Architecture

Functional Architecture

Logical Architecture

Practices of Real-time Financial Data Lake Projects

Stream Computing Development Platform

Advantages of a Stream Computing Development Platform

Architecture

“Straight-through” Real-time Scenarios

Real-time Decision-making Analysis

Real-time BI Analysis

Floor-to-floor Real-time Scenario

Real-time Financial Data Lake

Architecture

Data Flow Direction

Features of the Real-time Financial Data Lake

Achievements

Best Practices: Intelligent, Real-time, Anti-Fraud

Best Practices: Real-time BI

Original Source:

Application of Apache Flink in Real-time Financial Data Lake

Apache Flink Community China March 26, 2021 613 By Bai Xueyu, R&D engineer of Zhongyuan Bank This article is written by…

Written by Alibaba Cloud

No responses yet