N Methods for Migrating Data to MaxCompute
Bin Fu, a data technical expert from Alibaba Cloud, delivered a live speech entitled “N Methods for Migrating Data to MaxCompute” in the MaxCompute developer exchange group on DingTalk in 2018. In the presentation, he introduced an internal combat demo system developed by Alibaba Cloud. The system can be used to process big data, including both offline and real-time data, in an automatic and full-link manner. He also showed how to use the combat demo system to process big data. The following article is based on the live speech.
Big Data Center Architecture
The preceding figure shows the overall architecture of the big data combat demo system, which migrates data to the cloud in offline mode. In order to implement applications, the architecture of the big data combat demo system must include a data source system, a data warehouse, a big data application system platform, a web/application platform, process scheduling, data processing and display, and a real-time processing platform. The data warehouse consists of the temporary layer, basic data layer, and application layer.
In the entire architecture, data migration to the cloud enjoys the highest priority. The method for migrating data to the cloud can vary according to different data sources.
First, data is uploaded to MaxCompute, a big data storage and processing platform. Then, data is processed and transmitted to the temporary layer. After simple conversion, data is then transmitted to the basic data layer. Finally, data is aggregated at the application layer to provide services.
In this architecture, the process is scheduled through the DataWorks and DataX protocol tools, data is displayed through QuickBI and DataV, and real-time data is processed through DataHub + StreamCompute. After data processing, the system stores the processed data on Relational Database Service (RDS) of the big data application platform and displays data through QuickBI and DataV.
Methods for Data Migration to (Synchronization with) the Cloud
There are many methods for migrating data to the cloud. Typical MaxCompute built-in tools for migrating data to the cloud include Tunnel, DataX, and DataWorks:
- Tunnel uses Tunnel commands to upload and download data, process data files, and perform other operations.
- DataX acts as a synchronization tool for offline data to efficiently synchronize data among various heterogeneous data sources and upload data to MaxCompute. Heterogeneous data sources include MySQL, Oracle, SQL Server, PostgreSQL, HDFS, Hive, ADS, HBase, OTS, OSS, and MaxCompute. Tunnel is also required. If Tunnel is not used together with DataX, the synchronized data may become inconsistent after being migrated to MaxCompute, thereby causing development difficulties.
- DataWorks uses the data integration function of DataWorks to define a data synchronization task and migrates data to the cloud through the synchronization task. DataWorks is operated in wizard or script mode on the GUI in compliance with DataX protocols. To use DataWorks for data integration, users need to perform the following four steps:
- Step 1: Configure the data source (database and table) and data flow direction (database and table).
- Step 2: Configure field mapping relationships and ensure that the source table fields on the left of the GUI correspond to the target table fields on the right in one-to-one mode.
- Step 3: Filter source table fields based on the WHERE filter condition and leave WHERE keywords empty, and control data loading for source table fields.
- Step 4: Limit the synchronization rate, set the splitting key to the source table primary key, and specify the maximum number of error records. If the number of error records exceeds the specified threshold, the data synchronization task must be terminated.
Then, data can be migrated to MaxCompute.
Methods for Real-Time Data Migration to the Cloud
Logstash is a simple and powerful distributed log collection framework. It is often configured with ElasticSearch and Kibana to form the famous ELK stack, which is very suitable for analyzing log data. Alibaba Cloud StreamCompute provides a DataHub output/input plug-in for Logstashto help users collect more data on DataHub. Using Logstash, users can easily access more than 30 data sources in the Logstash open source community. Logstash also supports filters to customize the processing of transmission fields and provide other functions.
Alibaba Cloud DataHub is a streaming data processing platform. It allows users to release, subscribe to, and distribute streaming data. It can also help users easily create streaming data–based analysis tasks and applications. DataHub provides services to continuously collect, store, and process large amounts of streaming data generated by all kinds of mobile devices, application software, web services, sensors, and other items. Users can write application programs or use stream computing engines to process streaming data that has been written on DataHub, such as real-time web access logs, application logs, and various events. In addition, a variety of real-time data processing results are produced, including real-time charts, alarm information, and real-time statistics. Compared to Logstash, DataHub delivers better performance and is more suitable for processing complex data.
Data Migration and Real-Time Data Synchronization with the Cloud
Data Transmission Service (DTS) supports data transmission among data sources such as relational databases, NoSQL, and big data (OLAP). It integrates data migration, data subscription, and real-time data synchronization. Compared to third-party data streaming tools, DTS can provide more diversified, high-performance, and highly secure and reliable transmission links. It also offers plenty of convenient functions that greatly facilitate the creation and management of transmission links.
DTS is dedicated to resolving difficulties of long-distance and millisecond-level asynchronous data transmission in public cloud and hybrid cloud scenarios. With the remote multi-active infrastructure that supports the Alibaba Double 11 Shopping Carnival as its underlying data streaming infrastructure, DTS provides real-time data streams for thousands of downstream applications. It has been running online stably for three years.
Implementation at Data Architecture Layers
Implementation of Enterprise Data Models at Data Architecture Layers
When enterprise data models are implemented at data architecture layers, each layer has its own specifications. After being uploaded to MaxCompute, data is processed and transmitted to the temporary layer. After simple conversion, data is then transmitted to the basic data layer and finally aggregated at the application layer. The model design method of the temporary layer is the same as the source system data model, both of which have unified naming rules. The temporary layer is designed to store only temporary data and can be applied to the processing of extract, transform, and load (ETL) user data. The basic data layer uses a 3NF design. It is designed for data integration, long-term historical data storage, detailed data, and general summary data. It can be used for ad hoc queries and provides data sources for the application layer. The model of the application layer is designed in anti-regularization or star/snowflake mode for one or several applications. This layer can be applied to report queries and data mining.
Data sources are generated as follows:
Download the TPC-DS tool from the TPC official website. Use the downloaded tool to generate TPC-DS data files, and then generate HDFS, HBase, OSS, and RDS data sources as follows:
- To generate HDFS and HBase data sources, use Hadoop client commands to load the TPC-DS data files onto HDFS and HBase.
- To generate an OSS data source, use OSS client commands to load the TPC-DS data files onto OSS.
- To generate an RDS data source, load the TPC-DS data files onto RDS through the data integration function provided by DataWorks.
Data Migration to the Cloud
The design of the directory structure and naming rules for data migration to the cloud tasks must meet the following requirements:
- The root directory must be named 01_Data import format.
- In the directory structure, subdirectories must be created for different data sources. Data import tasks of the same data source must be placed in the same subdirectory.
- Subdirectories must be named in the following format: Source name + “To” + Target name.
Note that task types and naming rules vary in script mode and wizard mode.
The development of data migration to the cloud tasks consists of four modules: data source configuration, task development in script mode, task development in wizard mode, and task scheduling attribute configuration. The specific procedures of these four modules are as follows:
- Data source configuration: The configuration of an FTP data source is used as an example. The procedure is as follows: (1) Add a data source. (2) Configure structured storage for FTP. (3) Select and fill in attributes. (4) Test connectivity. (5) Click to complete the configuration. Then, the configured data source can be viewed on the data source page.
- Task development in script mode: The task for migrating data from an FTP data source to MaxCompute is used as an example. The procedure is as follows: (1) Create a task on the data integration page. (2) Select the script mode. (3) Click to enter the configuration page. (4) Generate a configuration file template. (5) Configure FTP Reader. (6) Configure MaxCompute Writer. (7) Click to save the task and name it by following the naming rules.
- Task development in wizard mode: The task for migrating data from an RDS data source to MaxCompute is used as an example. The procedure is as follows: (1) Choose Data Integration>Synchronization Task>Wizard mode. (2) Select a data source. (3) Select a table. (4) (Optional) Add data filter conditions. (5) (Optional) Configure the splitting key. (6) Select the target data source. (7) Select the target table. (8) Fill in partition information. (9) Select cleansing rules. (10) Configure field mapping relationships. (11) Set parameters related to channel control. (12) Integrate with the splitting key. (13) Fill in the task name. (14) Select the storage location. (15) Confirm that the task is created.
- Task scheduling attribute configuration: After a DataWorks task is created, relevant attributes can be configured. If users click Submit under Data Integration, they can configure the initial attributes of a newly created task. If users choose Data Development>Scheduling Configuration, they can modify or add task attributes.