Who Comes Next After the Rise of Data Warehouses and Data Lakes?

Business Background

The Typical Real-time Business Scenario

First, let’s look at a typical real-time business scenario, which is also the business scenario for most real-time computing users. In fact, the entire processing path composes a typical stream computing architecture. Specifically, user behavior data or Binlog entries for database synchronization are written to Kafka. Then, these real-time data consumed by Kafka are subscribed by synchronous tasks created by Flink. During this process, several tasks need to be done. For example, in the pre-processing phase, there is an online training task, which interconnects certain dimension tables and attributes. Sometimes, these attributes can all be loaded to compute nodes, however, they can also be too large to be loaded. Therefore, HBase is additionally required for high-concurrency point lookup. Another example is that some samples can also be written into HBase to create an interaction, and eventually sampling data or training models generated in real time need to be pushed to the search engine or algorithm modules for further processing. This is a complete path of machine learning, which is quite typical.

The Increasingly Complex Architecture

The aforementioned path can work with an offline path as complementary solutions. Although some companies have not yet established the real-time path and are using the offline path only, this set of paths is already a very mature solution. Then, what problems would occur by increasing the complexity of the online path? To test this, we need to make a change to the processing path. The path now writes real-time data to Kafka, and then performs real-time machine learning or metric computing with Flink. Next, it passes the computing results to online services, like HBase or Cassandra, for point lookup. Finally, the results will be delivered to the online dashboards for metrics data visualization.

A Typical Big Data Architecture, Lambda

The aforementioned big data architecture is used by most companies. Also, many companies have extracted a part of the architecture based on their business scenarios, so that they can build a complete big data architecture that processes both real-time and offline data. Thus far, this seems to be a perfect solution for practical problems. However, upon thinking about it, we can identify some potential problems that can make the maintenance of the architecture increasingly difficult. So, what are these problems? Let’s take a closer look at them.

Pain Points of Typical Big Data Architectures

Regarding these open-source storage products on the market, we will verify whether they can meet the business needs one by one.

Simplified Big Data Architecture

How do we solve the problems mentioned in the previous section in such a highly redundant and complex system? A useful practice is to simplify the Lambda architecture. The essence of the business is to perform real-time or offline processing (batch processing) on a data source. Starting from the business scenario, we want both real-time and offline data to be centrally stored in a single storage system, and the system must meet various business needs. This sounds illusory and complicated, because the system we want has to support a variety of scenarios. But if we could make it, the architecture would be perfect and essentially solve the computation of centralized stream and batch processing, as the same set of SQL statements can complete both stream processing and batch processing. Furthermore, by exploring these underlying principles, we can also solve the storage problem because both stream data and batch data can be stored in the same product.

Seemingly Perfect Data Lakes

For simplified architectures, let’s see some products released by the open source community, such as data lakes.

Insufficient Real-time Performance of Incremental Data Writing

The open-source real-time write operation is actually incremental rather than real-time. The difference is that the written data can be immediately retrieved by a query after a real-time write, which however is not possible after an incremental write. Instead, an incremental write writes data in batches to improve throughput. Therefore, this solution can not fully meet the real-time requirements for the data.

QPS of a Query

This architecture is expected to perform both real-time analysis and dimension table queries in stream computing. But, will the data in the storage support a high-concurrency query through Flink? For example, a query with hundreds of millions of QPS.

Concurrency of a Query

The whole solution is based on an offline computing engine and only supports low concurrency. To support the concurrency of several thousand QPS, a lot of resources are required, which increases costs.

My Thoughts on HSAP

What Is HSAP?

After making a detailed analysis of the preceding problems, according to the requirement of query concurrency or query latency, we have divided computing patterns into the following four categories.

  • Analytical: interactive analysis
  • Serving: online services with high QPS
  • Transaction: money-related business in a traditional database, which is not required by most business

Big Data Architectures Based on HSAP

After interfacing the HSAP system with the preceding simplified architecture, a perfect big data architecture is born. The HSAP system performs batch processing for offline data, by joining dimension tables with Flink. Then, it connects to online applications to provide online services, such as reports and dashboards.

The PostgreSQL Ecosystem

Then, how do we use online applications and systems after the introduction of the HSAP system? An ecosystem is required to reduce the difficulty of use. After repeated research, we identified the PostgreSQL ecosystem, as it has the following advantages:

A New-generation Real-time Interactive Engine, Hologres

Based on the preceding introduction, here comes the key topic today, which is the new-generation real-time interactive engine released on Alibaba Cloud. This engine is called Hologres. Why is it called Hologres? Hologres consists of Holographic and Postgres.

Architecture of Hologres

The architecture of Hologres is simple. From the bottom up, the bottom layer is a centralized storage system, which can be the Apsara distributed file system of Alibaba Cloud, HDFS, OSS, or S3. On the top of the storage layer is a computing layer, which provides computing services in an MMP-based architecture. On the top of the computing layer, the FE layer is used to distribute the plan to all computing nodes based on queries. At the top, the PostgreSQL ecosystem is connected. Queries on Hologres can be made when Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) drivers are ready.

Hologres: Cloud-native

1) Storage and Computing Disaggregation

  • No locking: The writing speed increases linearly with resource scale-out until the CPU is fully occupied.
  • Memory management: Hologres provides data caching and supports high-concurrency queries.
  • Vectorization: Hologres accelerates queries for column-oriented data through vectorization.
  • Storage optimization: Hologres provides better query performance for data, though it supports customizing query engines.

Typical Applications Based on Hologres

The following describes a typical application of Hologres in Alibaba. Data is written to and then pre-processed by Flink in real time, such as by going through real-time ETL processing or real-time training, and the processing results are directly written to Hologres. Hologres supports joint queries on dimension tables, result caching, complex real-time interactions, offline queries, and federated queries, which allows the entire business system to perform data operations only through Hologres. The online system can access data in Hologres through the PostgreSQL ecosystem. This eliminates the need to integrate with other systems, and the query and storage problems of the previously mentioned architecture.

A Real-time Data Warehouse without Compromises Powered by Flink + Hologres

Ultimately, to achieve a true real-time data warehouse without compromises, we need Flink and Hologres only. Flink is used for extract, transform, and load (ETL) processing of the stream and batch data. Then, Flink writes the processed data to Hologres for centralized storage and query. In this way, the business end can provide online services by directly connecting to Hologres.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store