DataWorks: A Platform for Developing and Governing a Data Lake

Definition

According to Wikipedia: “A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning.”

A Top Priority

A USB flash drive is a data lake based on the definition. Data lake also can be in the form of Object Storage Service (OSS), Hadoop Distributed File System (HDFS), or Pangu — an Apsara Distributed File System. They all strictly conform to the data lake definition. When selecting the technical specification of an enterprise data lake, a top priority is choosing a specific storage medium or system as the data lake solution. Obviously, different storage media or storage systems have their own advantages and disadvantages. For example, some storage systems have quicker response times for random reads or better throughput for batch reads. Others have lower storage costs, better system scalability, or better-structured data organization. Therefore, exploring the advantages and avoiding the disadvantages of all the storage systems is crucial.

Three Challenges

Metadata management, data integration, and data development are the three major issues of the data lake. As a general-purpose big data platform, DataWorks not only solves various problems in data warehouse scenarios but also solves the core pain point in the data lake scenario.

Metadata Management

Users need a unified and centralized management capability in the data lake, which is also the first core capability of the data lake. You can use the data governance capability of DataWorks to manage the metadata of various storage systems in the data lake. It currently manages 11 types of metadata of cloud data sources, including OSS, e-MapReduce (EMR), MaxCompute, Hologres, MySQL, PostgreSQL, SQL Server, Oracle, AnalyticDB for PostgreSQL, AnalyticDB for MySQL v2.0, and AnalyticDB for MySQL v3.0. In function, DataWorks supports metadata collection, storage and retrieval, online metadata service, data preview, category tagging, data lineage, data exploration, impact analysis, and resource optimization.

Data Integration

During the data management in a data lake, you will encounter the challenge of data transport and transformation among various storage systems. To solve the problem, DataWorks provides data integration capabilities for data import and export and data format conversion for 40 types of common data sources. It also covers offline and real-time synchronization scenarios, and solves complex network scenarios for external docking.

Data Development

After implementing storage management and data transport in the data lake, you must find better ways to serve the business by enabling the data in the data lake. This requires the introduction of various computing engines. The business department of the computing platform provides various computing engines, including open-source computing engines such as Spark, Presto, Hive, and Flink. They also offer proprietary computing engines like MaxCompute and Hologres. The difficulty is making full use of each engine’s advantages, allowing free data access and computing in the data lake. To solve this problem, DataWorks provides a convenient data transport method, which is suitable for data to transport through various engines in an all-in-one data development environment. From ad hoc query to periodic ETL development, DataWorks supports the development and O&M of unified computing tasks for each computing engine.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.