For enterprises, data has become an important asset; and as an asset, it certainly needs to be managed. Increases in business lead to more and more data applications. As enterprises go about creating data warehouses, their requirements for data management grow, and data quality is a vital component of data warehouse construction that cannot be overlooked. This article offers suggestions on how to generate quality data while constructing a MaxCompute data warehouse, and thus provides guidance for practical data governance.
Data Quality Assurance Principles
There are different standards for evaluating the quality of data among different industries and even among different enterprises. In this article, we evaluate four main aspects of data; namely, integrity, accuracy, consistency and timeliness.
Integrity indicates whether the data records and information are complete, and whether there is any missing data. Missing data includes missing records and missing information in certain fields within records, both of which may cause inaccurate statistical results. You could say that 'integrity' is the most basic assurance of data quality. For example, if one day a relatively stable business data volume of one million records per day suddenly drops by 10,000 records, it could well be the result of missing records. The other example is information missing from a certain field within a record. For instance, in a table containing results for college entrance examinations, if test scores correspond to exam registration numbers, the null value of this field should be 0. If the number is larger than 0, it indicates that the information is missing.
Accuracy refers to whether the information and data recorded within the data are accurate, and whether there are any exceptions or erroneous information. For example, negative scores on academic transcripts or order forms with no buyer information. Both of these are problematic. Ensuring the accuracy of records is also an essential principle of data quality.
Consistency is generally reflected in a wide range of data warehouses. For example, if the company has many business data warehouse branches, any single piece of data must be consistent throughout. For example, the type and length of the user ID must remain consistent throughout, from its online business database processing, to its data warehouse, to each data application node. Thus, the MaxCompute Data Warehouse Construction Standards Guide contains 'public layer' processing to ensure data consistency.
Ensure timely data output so as to retain the data's value. For example, decision-making analysts generally want to see data from the previous day rather than waiting for three to five days to see a data analysis result. Otherwise, the value of the data is lost, making data analysis completely meaningless.
Data Quality Management Processes
To manage data quality and formulate quality management standards that meet the above data quality principles, several aspects must be considered:
- Which data requires quality management?
- At what stage should data quality management be conducted?
- How is it done specifically?
Definition of Data Quality
Defining which data requires quality management can generally be done through data asset classification and metadata application link analysis. Determine the data asset level based on the degree of impact of the application. According to the data link lineage, the data asset level is pushed to all stages of data production and processing, so as to determine the asset level of the data involved in the link and the different processing methods adopted based on the different asset levels in each processing link.
Data Asset Level Definition
When it comes to data asset levels, from a quality management perspective, the nature of impact that data has on the business can be divided into five classes, namely: destructive in nature, global in nature, local in nature, general in nature, and unknown in nature. The different levels decline in importance incrementally. Specific definitions of each are as follows:
- Destructive in nature: namely, if something goes wrong with the data, it will cause major asset losses, and bring about a major loss in earnings.
- Global in nature: namely, data is directly or indirectly used for enterprise-level business, performance evaluation, and important decision-making.
- Local in nature: namely, data is directly or indirectly used for operations and reports of some business lines. If a problem occurs, it may affect the business line or result in loss of work efficiency.
- General in nature: namely, data that is primarily used for everyday data analysis and causes minimal impact when problems arise.
- Unknown in nature: namely, data application scenarios that cannot be identified.
For example, in the label level of the table, the asset level can be marked ‘Asset’: destructive-A1, global-A2, local-A3, general-A4, unknown-Ax. Degree of importance: A1> A2> A3> A4> Ax. If a piece of data appears in multiple application scenarios, the higher value principle is observed.
Implementation of Data Asset Levels
After the data asset levels have been defined, the question of how to implement them is considered, and how to classify the massive quantity of data in the data warehouse into asset levels. We can start with the data flow link.
The basic process of MaxCompute data processing: data is generated from the business system and imported into the data warehouse system (MaxCompute) through a synchronization tool (DataWorks data integration or Alibaba Cloud DTS). In the data warehouse, a series of operations occur, wherein the data is cleaned, processed, integrated, and its algorithms and models calculated. It is then output to the data product through a synchronization tool for consumption. The data from the entire process is stored in the form of tables, and the data flow link is roughly depicted as follows:
On the data flow link, identify which tables are consumed by which application products, sort these application products into separate data asset levels, then combine the upstream and downstream lineage of the data, and label the entire link as a certain level of asset. For example, for an A2-level data application product, Table 1, Table 2, and Table 3 are the export tables of the data warehouse (MaxCompute) imported to this data product, and several tables are marked with the A2-xxx data product tag. Based on a lineage trace, mark these tables with an A2 tag so that they are always marked to the source data business system.
Use the above method to confirm the data asset level and define different importance levels for different data.
Since we now know that the importance level of data varies for different data asset levels, we can adopt different measures to safeguard such data. Next, we will look at how to safeguard data of different levels in MaxCompute-based data warehouses.
Card Verification During Data Processing
Online System Card Verification
Card verification during online system data processing mainly refers to card verification during data generation in the business system. The data produced by the online business system is also the data source of the data warehouse. However, the online business system is generally complex and changeable, and each change will inevitably bring about data changes. The data warehouse must adapt to ever-changing business developments and ensure data accuracy in a timely manner. Therefore, it is also important to consider how to efficiently notify MaxCompute-based offline data warehouses of online business changes. Here we will introduce two methods for your reference: tools and personnel. In addition to using tools to automatically capture every business change, developers are also required to automatically provide notification when they become aware of any business changes.
Tools — the release platform. When a major business change occurs, offline developers subscribed to the release process will be notified of such changes to content. When the business system is complex enough and the day-to-day change notices are frequent, if offline services are notified of every change, unnecessary waste will occur and the efficiency of business iteration will also be affected. In this situation, we can use the system of data asset level identification to tag the business and figure out the changes that affect data processing for high-level data assets, such as relevant financial reports. If transformations to the business system affect the calculation of financial statements, and the agreed calculation criterion has been changed by the business system’s release notifications, offline personnel must be informed, and offline developers must also actively pay attention to such release change notifications.
Note: The release platform referred to here is not provided by Alibaba Cloud, but is a general term that refers to the respective release platforms for each enterprise’s online business.
Tools — database change awareness. As the business develops, database resizing or DDL changes will inevitably occur in the business database (the data source of MaxCompute data warehouse), and offline developers must be automatically notified of such changes. When a MaxCompute-based data warehouse extracts offline data, the DataWorks data integration tool may restrict a business database table. If the database table is expanded or migrated, and the data integration tool is unaware of such changes, it may cause errors in the data extraction; and if the data integration tool is incorrect, a series of downstream applications dependent on the table may be affected. Therefore, it is recommended that the business database also require a database table change notification.
The tool is only a supplementary device, with the core element still being the operator. When data asset levels have been established both upstream and downstream, the process should be explained to online developers, so that they may recognize core data assets and raise their level of data risk awareness. Through methods such as staff training, business developers should be informed of offline data demands, offline data processing procedures, and data product application methods, so that they may understand the importance and value of data. At the same, they should also be informed of the consequences of making errors. This allows business developers to keep data objectives in mind while working toward their business objectives, thus ensuring consistency on both business and data ends.
Offline System Card Verification
First, let’s take another look at the basic process of data processing using MaxCompute: data is generated from the business system, and enters the data warehouse system (MaxCompute) through a synchronization tool (DataWorks data integration or Alibaba Cloud DTS ). In the data warehouse, a series of operations occur, wherein the data is cleaned, processed, integrated, and its algorithms and models calculated. It is then output to the data product through a synchronization tool for consumption.
It is only through data processing that the data warehouse model and data warehouse code are constructed. During this entire process, an important question for offline data warehouses is how to ensure the quality of data processing.
MaxCompute processes data. You can use DataWorks, MaxCompute studio, or submit various tasks directly to MaxCompute SDK for processing. No matter what tool you use, you will have to go through the process of code development-> testing and release-> O & M, and modification. You can perform card verification for each link in this process.
- Card verification based on the submitted code. Check related rules before submitting SQL statements. Currently, there are no tools available on the public cloud to perform this verification, but users with the necessary skills may develop related tools on their own. Rule categories include:
- Code specification rules, such as table naming rules, lifecycle settings, and table comments.
- Code quality rules, such as 0 notification for the denominator, reminders for the impact of NULL values on calculation results, and incorrect field insertion sequences.
- Code performance rules, such as partition pruning failure, large table scanning reminders, and repeated computing detection.
- Card verification when the task is released online. To ensure the accuracy of online data, each change must be tested before releasing data to the online production environment. Moreover, such releases will only be considered successful when the production environment has been tested and passed.
- Task changes or data rerunning are inevitable during offline data processing. Before updating the changes, you need to notify downstream users of the reasons, logic, and time of the changes. If the downstream users have no objection to the change, the release should be carried out at the agreed time to minimize the impact of the change on downstream users.
Data Risk Point Monitoring
The previous section described how to ensure the consistency of online and offline data through card verification during data processing. This section describes how to ensure data accuracy through monitoring data risk points.
Online Data Risk Point Monitoring
During the process of generating data in the online business system, data quality must be ensured, primarily by monitoring data according to certain business rules. MaxCompute itself does not come with any supplementary tools, so users must implement such tools themselves. To follow are some suggestions for your reference.
One example is to perform rule validation for the records in the database table and create some monitoring rules. In the business system, data is verified when it is stored in the database during each business process. In a transaction system, monitoring rules are configured for the time the order was taken, the order end time, order payment amount, and order status flow. The time the order was taken cannot be later than the current day, nor can it be earlier than the business system launch time. Once an exception occurs, the verification fails. When the business is complex and there are many rules, and the operating costs such as rule configuration are high, the data asset level is also monitored.
Offline Data Risk Point Monitoring
This section describes the monitoring of offline data risk points during MaxCompute-based data warehouse construction. It reports primarily on the monitoring of data accuracy and data output timeliness.
Data accuracy is the key to data quality. Therefore, data accuracy has become a top priority for direct data connections and the principal form of assurance for all offline systems during processing. Below, we describe how to ensure the accuracy of MaxCompute offline data using the DataWorks data quality tool, DQC.
Note: To use DQC, you must use DataWorks for task scheduling and execution.
Let’s first get to know the DQC tool architecture: DQC uses DataSet as the monitoring object. When the offline MaxCompute data changes, DQC verifies the data and blocks the production link to avoid the spread of problematic data pollution. Furthermore, DQC allows for the management of verification history results, which enables you to analyze and grade the data quality.
As shown in the figure above, DQC primarily configures data quality verification rules so as to automatically monitor data quality during data processing. DQC can monitor data quality and issue alerts. It does not process data output, and when an alert is triggered, the recipient of the alert must determine how best to handle the situation.
DQC data monitoring rules can be divided into strong rules and weak rules. Strong rules: once an alert has been triggered, the execution of the task will be blocked (the task is set to a failed state, so that the downstream tasks will not be triggered); weak rules: an alert is issued but the execution of the task is not blocked. Based on Alibaba’s internal experience, DQC provided some templates for common rules, including: fluctuation in the table’s number of rows compared to ’N’ days before; fluctuation in the tablespace size compared to ’N’ days before; fluctuation in the maximum / minimum / mean value of the field compared to ’N’ days before, the field null value / unique number, and so on. For more information, see the Data Quality module in the DataWorks user manual.
The DQC workflow is shown in the following figure:
From this we can see that the DQC check is actually also an SQL task, but that it is nested in the primary task. If there are too many checks, it will affect the overall performance of the task’s execution. Therefore, the questions of what data needs to be configured by DQC rules and what rules should be configured are also determined according to the data asset level. For example, A1 and A2 data monitoring rates must exceed 90%, and more than three types of rule classes are required so that no firm requests are made for unimportant data assets.
Similar rules are configured by offline developers to ensure data accuracy. Of course, different businesses are also constrained by business rules, which arise from the business needs of data products or consumption. A consumption node is configured and pushed up to the starting point of the offline system for monitoring to minimize the impact of rules.
Timeliness of Data
To ensure data accuracy, data needs to be provided in a timely manner, otherwise the value of the data diminishes considerably, or is even rendered worthless. So, ensuring data timeliness is also a top priority for ensuring data quality.
For offline tasks based on MaxCompute, such as those with daily time intervals, time requirements are placed on the data output of some important daily tasks. For example, some decision reports must be generated at 9:00 or earlier. To ensure data integrity, daily tasks generally start to run at 0:00, when the data of the previous day is computed. Most of these tasks are run at night. To ensure that the data is generated on time, the order of priority in which tasks are executed (especially when the project involves many tasks and resources are limited), as well as task execution failures or alerts for exceeding timeout limits must all be considered. The “importance” of the important tasks here again refers to the classification of data asset levels mentioned above. The higher the importance level, the higher the assurance of priority.
- Task priority The priority level for tasks on the MaxCompute platform are all the same and cannot be configured. Therefore, to implement a “priority” function for MaxCompute tasks, you have to start from the scheduling platform, where priority tasks can be scheduled first.
To schedule tasks on the DataWorks platform, when the corresponding project uses pre-paid resources (pre-purchased fixed computing resources are only available for the current project), the “intelligent monitor” tool can be used to set priority tasks. DataWorks scheduling is a tree structure. When the priority-level of a leaf node is configured, the priority is transmitted to all upstream nodes, and the leaf node is usually the consumption node of service businesses. Therefore, when it comes to priority settings, first determine the asset level of the task. The higher the level of the task, the higher the consumption node priority configuration, with priority scheduled tasks given priority status of computing resources to ensure that such tasks are produced on time.
When the project to which the DataWorks node task belongs uses the post-paid resources of MaxCompute (pay-as-you-go for computing without fixed resources), the priority of the smart monitoring configuration is invalid. Therefore, you need to assess whether you want to purchase prepaid resources so as to optimize tasks, reduce unnecessary resource waste, and strive to complete computing tasks more efficiently given limited resources.
- Task alert Task alarms and priorities are similar. You can use DataWorks’ “smart monitoring” tool to perform configuration. Simply configure the leaf nodes to transfer them upstream. Errors or possible latencies during task execution are inevitable. To ensure the most important data (high asset level) gets produced, we need “errors” to be handled immediately, and “possible” delays to be detected and dealt with.
- DataWorks — Intelligent Monitoring. When MaxCompute uses DataWorks to schedule offline tasks, DataWorks provides an intelligent monitoring tool to monitor and issue alerts for the scheduled tasks.
Intelligent Monitor is a monitoring and analysis system for DataWorks task operations. Intelligent Monitor determines if, when, how, and to whom alarms are reported, as well as the object to which the alarm is reported based on the monitoring rules and task running situation. It automatically selects the most appropriate time, method, and object to report an alarm.
Intelligent Monitor aims to:
- Reduce your configuration costs.
- Prevent invalid alarms.
- Automatically cover all important tasks (for quantities that you are unable to handle by yourself).
Data Quality Measurement
The previous section provides solutions to ensure the data quality of MaxCompute-based data warehouses. However, whether these solutions are suitable or need to be improved requires a set of metrics for measurement.
For example, if frequent alerts are sent from the DataWorks Intelligent Monitoring system, each time a data quality event occurs, it is necessary to analyze the cause, the handling process, and subsequent preventative solutions for similar future incidents. Serious data quality incidents are upgraded to faults, and faults must be defined, classified, processed, and reviewed.
Related Tool Links
- DataWorks-data quality management tools, Documentation, Tool Interface.
- DataWorks-intelligent monitoring tool, Documentation, Tool Interface.