N Methods for Migrating Data to MaxCompute

Bin Fu, a data technical expert from Alibaba Cloud, delivered a live speech entitled “N Methods for Migrating Data to MaxCompute” in the MaxCompute developer exchange group on DingTalk in 2018. In the presentation, he introduced an internal combat demo system developed by Alibaba Cloud. The system can be used to process big data, including both offline and real-time data, in an automatic and full-link manner. He also showed how to use the combat demo system to process big data. The following article is based on the live speech.

Big Data Center Architecture

Methods for Data Migration to (Synchronization with) the Cloud

There are many methods for migrating data to the cloud. Typical MaxCompute built-in tools for migrating data to the cloud include Tunnel, DataX, and DataWorks:

  1. DataX acts as a synchronization tool for offline data to efficiently synchronize data among various heterogeneous data sources and upload data to MaxCompute. Heterogeneous data sources include MySQL, Oracle, SQL Server, PostgreSQL, HDFS, Hive, ADS, HBase, OTS, OSS, and MaxCompute. Tunnel is also required. If Tunnel is not used together with DataX, the synchronized data may become inconsistent after being migrated to MaxCompute, thereby causing development difficulties.
  2. DataWorks uses the data integration function of DataWorks to define a data synchronization task and migrates data to the cloud through the synchronization task. DataWorks is operated in wizard or script mode on the GUI in compliance with DataX protocols. To use DataWorks for data integration, users need to perform the following four steps:
  3. Step 1: Configure the data source (database and table) and data flow direction (database and table).
  4. Step 2: Configure field mapping relationships and ensure that the source table fields on the left of the GUI correspond to the target table fields on the right in one-to-one mode.
  5. Step 3: Filter source table fields based on the WHERE filter condition and leave WHERE keywords empty, and control data loading for source table fields.
  6. Step 4: Limit the synchronization rate, set the splitting key to the source table primary key, and specify the maximum number of error records. If the number of error records exceeds the specified threshold, the data synchronization task must be terminated.

Methods for Real-Time Data Migration to the Cloud

Logstash

Logstash is a simple and powerful distributed log collection framework. It is often configured with ElasticSearch and Kibana to form the famous ELK stack, which is very suitable for analyzing log data. Alibaba Cloud StreamCompute provides a DataHub output/input plug-in for Logstashto help users collect more data on DataHub. Using Logstash, users can easily access more than 30 data sources in the Logstash open source community. Logstash also supports filters to customize the processing of transmission fields and provide other functions.

DataHub API

Alibaba Cloud DataHub is a streaming data processing platform. It allows users to release, subscribe to, and distribute streaming data. It can also help users easily create streaming data–based analysis tasks and applications. DataHub provides services to continuously collect, store, and process large amounts of streaming data generated by all kinds of mobile devices, application software, web services, sensors, and other items. Users can write application programs or use stream computing engines to process streaming data that has been written on DataHub, such as real-time web access logs, application logs, and various events. In addition, a variety of real-time data processing results are produced, including real-time charts, alarm information, and real-time statistics. Compared to Logstash, DataHub delivers better performance and is more suitable for processing complex data.

Data Migration and Real-Time Data Synchronization with the Cloud

Data Transmission Service (DTS) supports data transmission among data sources such as relational databases, NoSQL, and big data (OLAP). It integrates data migration, data subscription, and real-time data synchronization. Compared to third-party data streaming tools, DTS can provide more diversified, high-performance, and highly secure and reliable transmission links. It also offers plenty of convenient functions that greatly facilitate the creation and management of transmission links.

Implementation at Data Architecture Layers

Implementation of Enterprise Data Models at Data Architecture Layers

Data Generation

Data sources are generated as follows:

  1. To generate an OSS data source, use OSS client commands to load the TPC-DS data files onto OSS.
  2. To generate an RDS data source, load the TPC-DS data files onto RDS through the data integration function provided by DataWorks.

Data Migration to the Cloud

The design of the directory structure and naming rules for data migration to the cloud tasks must meet the following requirements:

  1. In the directory structure, subdirectories must be created for different data sources. Data import tasks of the same data source must be placed in the same subdirectory.
  2. Subdirectories must be named in the following format: Source name + “To” + Target name.
  1. Task development in script mode: The task for migrating data from an FTP data source to MaxCompute is used as an example. The procedure is as follows: (1) Create a task on the data integration page. (2) Select the script mode. (3) Click to enter the configuration page. (4) Generate a configuration file template. (5) Configure FTP Reader. (6) Configure MaxCompute Writer. (7) Click to save the task and name it by following the naming rules.
  2. Task development in wizard mode: The task for migrating data from an RDS data source to MaxCompute is used as an example. The procedure is as follows: (1) Choose Data Integration>Synchronization Task>Wizard mode. (2) Select a data source. (3) Select a table. (4) (Optional) Add data filter conditions. (5) (Optional) Configure the splitting key. (6) Select the target data source. (7) Select the target table. (8) Fill in partition information. (9) Select cleansing rules. (10) Configure field mapping relationships. (11) Set parameters related to channel control. (12) Integrate with the splitting key. (13) Fill in the task name. (14) Select the storage location. (15) Confirm that the task is created.
  3. Task scheduling attribute configuration: After a DataWorks task is created, relevant attributes can be configured. If users click Submit under Data Integration, they can configure the initial attributes of a newly created task. If users choose Data Development>Scheduling Configuration, they can modify or add task attributes.

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store