Data Conversion Practices of MaxCompute Data Warehouse

Data Architecture and Process

  1. The temporary layer contains incremental data and full data.
  2. The basic data layer can permanently store data. It contains a core model and general summary data. The core model contains customer, offering, event, channel, and code data. This layer creates model tables by using the entity and attribute naming specifications of the data warehouse. Tables at this layer can be classified as primary tables, history tables, and incremental tables. These tables store historical data and feature efficient usage and convenient design.
  3. The application layer contains data marts that cover customer analysis, sales analysis, and offering inventory analysis. Unlike the basic data layer, which permanently stores data, the application layer only stores required data. However, like the basic data layer, the application layer also creates model tables by using the entity and attribute naming specifications of the data warehouse.

ETL Algorithms

There are five extract, transform, load (ETL) strategies: full table replace (M1), update or insert (M2), loading (M3), full history table (M4), and incremental history table (M5). The following describes the update or insert (to primary tables) (M2), loading (M3), and full history table (M4) ETL algorithms.

Update or insert (to primary tables) algorithm

This algorithm applies to keeping the latest-state table. It compares data based on primary keys (or specified fields). If records exist in the target table, the existing records are updated. Otherwise, new records are inserted into the table. Full Outer Join is required to implement this algorithm because MaxCompute does not support “update or delete.”

Loading algorithm

The loading algorithm is used to add incremental data directly to target tables. It applies to incremental, unmodified data such as flows, transactions, events, and CDRs.

Full history table algorithm

A full history table must contain the s_date (start date) and e_date (end date) fields, which record the data modification track in the table. The s_date field indicates the start date of data in the table. If this field is missing in a service table during initial loading, the earliest date is used. The e_date field indicates the expiration date or renewed effective date of data in the table. During initial loading, this field must be set to the latest date. The full history table algorithm involves two loadings and four data types. The two loadings include initial loading and routine loading. During initial loading, full data is loaded to the history table, in which the start date is the service date or the earliest date and the end date is the latest date. During routine loading, all fields excluding s_date and e_date are compared, and Full Outer Join is used to generate new, expired, and unmodified data. In this way, expired data is found directly in the history table. The four data types include currently new data, currently expired data, currently unmodified data, and expired data. Currently new data indicates data whose start date is the data date and end date is the latest date. Currently expired data indicates data whose start date remains unchanged and end date is the data date. Currently unmodified data indicates data whose start date and end date remain unchanged. Expired data indicates data whose start date and end date remain unchanged but the data itself is invalid. All the preceding types of data can be directly inserted into the new history table.

NULL Value Processing

NULL is a SQL keyword indicating unknown data or values. It does not specify data types or data features. Upon comparison of any value with NULL, false is returned and the comparison result is empty. In the OLTP system, most table fields are set to NULL. During table association or field convergence involving the NULL value, the result may be different from business personnel’s expectations. Except for special cases, we recommend that you convert the NULL value of fields into non-NULL values before saving data to tables in the data warehouse. The NULL issue is related to database technical processing. Because of the NULL value, the computing result may be different from business personnel’s expectations. You need to exercise caution when running SQL statements and analyzing data involving the NULL value.

ETL on MaxCompute and DataWorks

Unified ETL script development

The ETL program reads column schema from the MaxCompute metadata table and generates a unified script based on the column schema. The ETL logic is fixed. You can use the ETL program to generate algorithm scripts and modify the NULL processing content in the scripts. If the Python and Python ODPS plug-in have been installed, name the program scripts_gen.py and set the ODPS configuration file, target table name, source table name, primary keys, and ETL algorithm parameters to generate a SQL script file.

ETL task mapping

Before developing ETL conversion tasks, you need to sort the mapping between tasks to facilitate task development and progress records. During or after development, you need to record the field-level mapping and conversion in a detailed document to facilitate unified task management and maintenance.

ETL conversion task development — example

ETL development procedure

There are three steps in the ETL development procedure: generate an ETL script file, create a DataWorks task, and test and release the script file. The detailed process is as follows:

  1. Run the scriptsGen.py program, set parameters based on the ETL algorithm, generate a unified ETL script file, and modify the NULL processing content in the script file.
  2. Create the corresponding directory and task on the DataWorks data development page, and copy the SQL script file to the new task.
  3. Test the running, set scheduling parameters, and click Submit.

Summary

In this article, we discussed about the MaxCompute data architecture as well as data conversion practices of MaxCompute data warehouse using three ETL algorithms. We have also seen how to implement ETL on Alibaba Cloud using DataWorks.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com