N Methods for Migrating Data to MaxCompute

Big Data Center Architecture

Methods for Data Migration to (Synchronization with) the Cloud

  1. Tunnel uses Tunnel commands to upload and download data, process data files, and perform other operations.
  2. DataX acts as a synchronization tool for offline data to efficiently synchronize data among various heterogeneous data sources and upload data to MaxCompute. Heterogeneous data sources include MySQL, Oracle, SQL Server, PostgreSQL, HDFS, Hive, ADS, HBase, OTS, OSS, and MaxCompute. Tunnel is also required. If Tunnel is not used together with DataX, the synchronized data may become inconsistent after being migrated to MaxCompute, thereby causing development difficulties.
  3. DataWorks uses the data integration function of DataWorks to define a data synchronization task and migrates data to the cloud through the synchronization task. DataWorks is operated in wizard or script mode on the GUI in compliance with DataX protocols. To use DataWorks for data integration, users need to perform the following four steps:
  4. Step 1: Configure the data source (database and table) and data flow direction (database and table).
  5. Step 2: Configure field mapping relationships and ensure that the source table fields on the left of the GUI correspond to the target table fields on the right in one-to-one mode.
  6. Step 3: Filter source table fields based on the WHERE filter condition and leave WHERE keywords empty, and control data loading for source table fields.
  7. Step 4: Limit the synchronization rate, set the splitting key to the source table primary key, and specify the maximum number of error records. If the number of error records exceeds the specified threshold, the data synchronization task must be terminated.

Methods for Real-Time Data Migration to the Cloud

Logstash

DataHub API

Data Migration and Real-Time Data Synchronization with the Cloud

Implementation at Data Architecture Layers

Implementation of Enterprise Data Models at Data Architecture Layers

Data Generation

  1. To generate HDFS and HBase data sources, use Hadoop client commands to load the TPC-DS data files onto HDFS and HBase.
  2. To generate an OSS data source, use OSS client commands to load the TPC-DS data files onto OSS.
  3. To generate an RDS data source, load the TPC-DS data files onto RDS through the data integration function provided by DataWorks.

Data Migration to the Cloud

  1. The root directory must be named 01_Data import format.
  2. In the directory structure, subdirectories must be created for different data sources. Data import tasks of the same data source must be placed in the same subdirectory.
  3. Subdirectories must be named in the following format: Source name + “To” + Target name.
  1. Data source configuration: The configuration of an FTP data source is used as an example. The procedure is as follows: (1) Add a data source. (2) Configure structured storage for FTP. (3) Select and fill in attributes. (4) Test connectivity. (5) Click to complete the configuration. Then, the configured data source can be viewed on the data source page.
  2. Task development in script mode: The task for migrating data from an FTP data source to MaxCompute is used as an example. The procedure is as follows: (1) Create a task on the data integration page. (2) Select the script mode. (3) Click to enter the configuration page. (4) Generate a configuration file template. (5) Configure FTP Reader. (6) Configure MaxCompute Writer. (7) Click to save the task and name it by following the naming rules.
  3. Task development in wizard mode: The task for migrating data from an RDS data source to MaxCompute is used as an example. The procedure is as follows: (1) Choose Data Integration>Synchronization Task>Wizard mode. (2) Select a data source. (3) Select a table. (4) (Optional) Add data filter conditions. (5) (Optional) Configure the splitting key. (6) Select the target data source. (7) Select the target table. (8) Fill in partition information. (9) Select cleansing rules. (10) Configure field mapping relationships. (11) Set parameters related to channel control. (12) Integrate with the splitting key. (13) Fill in the task name. (14) Select the storage location. (15) Confirm that the task is created.
  4. Task scheduling attribute configuration: After a DataWorks task is created, relevant attributes can be configured. If users click Submit under Data Integration, they can configure the initial attributes of a newly created task. If users choose Data Development>Scheduling Configuration, they can modify or add task attributes.

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Getting Started with Kubernetes | A Brief History of Cloud-native Technologies

How to Create a Unified User Experience in Internal Tools

Moving Slow to Move Fast

Alibaba Cloud Upgrades Its Cloud-Native Partner Program

Apache Hop, a quick introduction

How I use Proxyman to see HTTP requests/responses on my iPhone?

How to do code reviews effectively

Sinatra: Liquid Refreshment

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Kafka Connect: An Easier Way to Connect Messages with Data Stores

Eventbus written in Python based on Kafka

How Apache Kafka Consumer Works

Apache Beam ParDo Transformations