By Zhang Liangmo, senior product expert at Alibaba Cloud
Ironically enough, when it comes to designing a data warehouse, “data” is the most likely element to be ignored. The concept of data warehouse was proposed in 1990 and has gone through four main stages: database evolving into data warehouse, the MPP architecture, the data warehouse in the big data era, and today’s cloud-native data warehouse. Throughout this process, data warehouse has faced different challenges.
Challenges Faced by Traditional Data Warehouses
First, the startup cost is high, the construction period is long, and it is difficult to quickly verify the value.
For data warehouse construction personnel, the challenge is to shorten the data warehouse construction period. However, traditional data warehouses often have a long lifecycle, from purchasing servers, building physical warehouses to building logical warehouses. Therefore, the first challenge of data warehouses is how to shorten the construction period.
Second, it is challenging to process diversified data, embrace new technologies, and maximize data value.
With the advent of big data, traditional data warehouses mostly manage structured data. The second challenge to traditional data warehouses is to uniformly and comprehensively manage semi-structured data.
Third, it is difficult to share enterprise data assets and is costly to innovate data.
Data warehouses place greater emphasis on management and security. Therefore, it is challenging to better share and exchange data within the organization and throughout the ecosystem. For example, many isolated data islands still exist between enterprise departments or between businesses. Data sharing costs are high and unified enterprise-level data acquisition and export are not available. This causes difficulties in data acquisition and self-analysis for data consumers, and enterprises rely heavily on IT departments to meet their needs for data.
Fourth, the platform architecture is complex with high operation costs.
With more and more types of data and larger volume of data being processed, different technologies are superimposed, making data warehouse architectures more complex. An enterprise usually has data warehouses of various technologies at the same time. Therefore, simplifying the data warehouse architecture is also a major challenge. Typically, a dedicated team of engineers is needed for managing complex data platforms and managing resource low utilization.
Fifth, it is challenging to realize the scalability, elasticity, and flexibility needed for businesses.
Enterprises with rapid business development often need to deal with big promotion activities, supplement data, and handle unconventional events. It also brings many challenges in terms of how to quickly expand the data warehouse performance and improve the response time to business peaks and valleys.
How does the new data warehouse, as driven by technologies and services, deal with these challenges faced by traditional data warehouses? There are six major driving forces here.
- First, we hope to have a unified data platform that can connect, store and process a variety of data.
- Second, enterprises are data-driven to make support and decision-making information in real time, which requires higher timeliness.
- Third, the data volume has become very large. To find the desired data from such a large amount of data, a map is required for data management and governance.
- Fourth, in a traditional data warehouse, data is stored in a centralized manner, and the data must be stored in the same storage. Driven by new businesses, data needs to be connected instead of stored in a unified manner.
- Fifth, how to support more intelligent applications on top of the data warehouse, the information-based services and the business informatization. This is the driving force for intelligent data warehouses.
- Sixth, different roles in the data field have different requirements for data platforms. For example, data engineers, data analysts, and data scientists have different requirements for data platforms in terms of response time, processing speed, data volume, and development languages. Therefore, providing more analytics services is the sixth driver for data management platforms.
In the process of continuous evolution, the warehouse has been endowed with more and more new connotations compared with 30 years ago. To understand the new connotation, we can clearly see the cloud native, lake house, realtime-offline unification, and the evolution trend of SAAS-based service models.
Cloud Native — The Evolution Direction for Data Warehouse Infrastructure
Cloud native is a basic evolution direction of data warehouse infrastructure. Traditional data warehouses are based on physical servers or on-cloud hosted servers. Combined with cloud native, more basic cloud services, including storage services, network services, and more monitoring services are applicable. This means that on-cloud self-service and elasticity capabilities are available through using cloud native services so that cloud data warehouses can better integrate with more cloud services, including extracting log data from various data sources to data warehouse and implementing comprehensive-procedure data management and machine learning. Therefore, cloud native involves building and integrating with cloud services in a native way.
Alibaba Cloud has many business scenarios and systems. How can we manage data generated in these scenarios? How does a data warehouse help us out and how does it evolve?
As shown in the figure, the cloud native solution makes full use of the elastic computing, storage, and security capabilities of the cloud at the underlying layer. It can be seen that Alibaba Cloud shields all the complexities of the cloud. As a user of the data platform, you only need to activate services, create a project on the web platform, and then activate a data warehouse in five minutes to develop the model behind the data warehouse. This greatly reduces the service delivery period, and simplifies the underlying architecture of the data warehouse together with the technical architecture construction process. Another feature is about the extensibility of the cloud native data warehouse. Whether it’s a job that requires only 1 CU or a job that may require 10,000 CUs, the platform schedules resources to process data as needed. Therefore, cloud native brings us nearly unlimited extensibility.
Lake House: The Evolution Direction for the Data Architecture of Data Warehouse
For the lake house, let’s first discuss what drives it forward. We have to say that data warehouse is still the best solution for enterprise data management so far. Most enterprises have their own data warehouses, which may be constructed based on different technical modalities. In terms of processing policies, semantic support, scenario optimization, and engineering experience, data warehouse is an optimal solution as proved in previous practices. As data grows, enterprises need more flexible and agile data exploration.
Meanwhile, unknown data may need to be stored before further exploration. Therefore, enterprises must combine two advantages — data analysis optimization and data exploration — to bring different advantages to enterprises in terms of processing policies, semantic support, and use cases, data warehouses and data lakes. Data warehouses are easy to manage with high data quality, while data lakes provide enterprises with advantages in terms of exploration and flexibility. We need to think about and discuss how to combine the two. This is the background of the “lake house”.
In MaxCompute’s data warehouse-based scenario, the optimal engineering experience and management experience of the data warehouse for data management, coupled with the flexibility of data lake for data management and data processing were combined together for a new data management architecture called lake house proposed for the first time in 2019.
MaxCompute-based data warehouses provide a safe, reliable, and structured data management method. In addition, DataWorks provides data lineage, data map, data governance, and other capabilities. How are these capabilities extended into the data lake? For the data lakes, including cloud-based OSS and Hadoop HDFS-based data lakes, how can we enable more capabilities for exploration to improve data processing performance, management capability, and security on top of the existing flexibility?
All we need to do is to connect the data warehouse to the data lake, discover the metadata of the data lake, perform structured and unified management, and integrate with the flexibility and convenience of the lake through [Data Lake Formation (DLF)](). This is the warehouse-centered data management architecture featuring lake house. This means one step up for the data warehouse in terms of the enterprise data management.
Offline-Realtime Unification — The Evolution Direction for Data Analysis of Data Warehouse
In enterprise data warehouses, data collection is performed through subscriptions, such as SLS and Kafka, which is divided into three types. The first one is to archive some data in the data warehouse and then perform full analysis. The second one is to perform real-time query and analysis. For example, in a risk control scenario, real-time connection analysis is required to find out the call records of a phone in the past three years. The third one is to perform multi-dimensional query.
After real-time data is associated, data can be processed in batches, processed in real time, and queried. The acquisition, computing, and application of real-time data constitute three core aspects of the entire data warehouse development from offline to real-time. The core here is computing which includes active computing and passive computing. Offline computing is often passive in which data warehouse engineers need to define tasks to schedule jobs so that new results can be computed.
In addition to passive computing, real-time-offline unification requires active computing. When data flows in, the new or intermediate result can be automatically calculated when a job is inserted or restarted without manual intervention. Participating in real-time computing increases the process of active computing to the greatest extent, and the active results bring us the desired result data without rescheduling any jobs.
While some business problems can be solved in the case of offline-realtime unification, the architecture is very complex. Therefore, Alibaba Cloud proposed an offline real-time unified data warehouse architecture. Simply put, only a few core products are needed to realize offline-realtime unified architecture. The data sources include transaction data, user data, and device data generated by each server. Through Log Service, the data is regularly archived to Hologres, and real-time data warehouses perform real-time computing with stream computing. Then there is a full data warehouse that performs active computing, passive computing, and real-time data acquisition. The result data can be analyzed in Hologres directly without any relocation. Realtime data acquisition, realtime data computing, and realtime data analysis services are integrated together to greatly simplify the architecture, which is what we call today’s offline realtime unified cloud data warehouse.
SaaS Mode — Evolution Direction for Data Warehouse Service Model
How are the services delivered based on the evolution of the data warehouse infrastructure, data management architecture, and data analytics architecture? The answer is to deliver data warehouses to customers in the SaaS mode, which can simplify to the greatest extent the use of data warehouse services.
There are several ways to build a data warehouse. The first is, as known to all, to build a data warehouse based on a physical server. The second is to build a semi-hosted cloud data warehouse based on Hadoop on the cloud, or based on various MPP databases. The third and fourth ones work in an in-depth cloud native way. The third is a typical Snowflake solution, under which basic cloud services are not actually exposed to data warehouse managers, so it works in an embedded manner. It embeds the IaaS layer into the PaaS layer, but the final data warehouse is exposed through the complete web of SaaS. In 2021, 13 global vendors participated in the Forrester evaluation, of which only three delivered data warehouse services in SaaS mode, namely Google’s BigQuery, Snowflake and Alibaba Cloud MaxCompute.
It can be seen that the cloud computing data warehouse service has helped us minimize the complexity of data warehouse management from self-built to cloud native. There are less layers for the entire architecture with no need to manage clusters and software. O&M is also eliminated through servicization to remove all the underlying contents that need management. The background upgrade is provided by cloud vendors, which allows us to manage our own data and data models only, and use the data warehouse service through web. Data stored in the data warehouse is paid according to the storage capacity, just like cloud storage. It is also the same case with computing, which fully reflects the advantages of SaaS. In addition, we have powerful elasticity in matching business needs. Many of our customers only need 10,000 of hash power for daily business, instead of 30,000 during Double 11. Under such services with SaaS mode, we can ensure sufficient elasticity to meet the various work requirements of the data warehouse while being completely imperceptible to the users.
To sum up, data warehouses have evolved from databases in 1990 to data warehouses, to MPP architecture, to data warehouses in the big data era, and to cloud-native data warehouses today. Cloud-native infrastructure, lake house of data lake architecture, realtime-offline data analysis, and SaaS-based data warehouse service modes are the four main directions of evolution. Alibaba Cloud is building a new data warehouse architecture to improve the user experience of data management.
You can learn more about Alibaba Cloud MaxCompute by visiting https://www.alibabacloud.com/product/maxcompute