How to Synchronize Data from Hive to MaxCompute?

1) Features, Technical Architecture, and Principles of MMA

1.1 MMA Features

MaxCompute Migration Assist (MMA) is a MaxCompute data migration tool that is used for batch processing, storage, data integration, and job orchestration and scheduling. MMA has a migration evaluation and analysis feature that automatically generates migration evaluation reports, which help you determine data type mapping compatibility issues when synchronizing data from Hive to MaxCompute, such as syntax issues.

1.2 MMA Architecture

The following figure displays the MMA architecture. The left side shows the customer’s Hadoop cluster, and the right side shows Alibaba Cloud big data services, mainly DataWorks and MaxCompute.

1.3 Technical Architecture and Principles of MMA Agent

MMA supports the batch migration of data and workflows through the client and server. The MMA client installed on your server provides the following features:

  • Generate DDL and user-defined table function (UDTF) statements
  • Create tables in batches and migrate Hive data in batches
  • Meta Processor batch converts Hive metadata into MaxCompute DDL statements based on the results generated by Meta Carrier, including the table creation statements and data type conversion statements.
  • The built-in ODPS Console component allows you to batch create MaxCompute tables by using the MaxCompute DDL statements generated by Meta Processor.
  • Finally, the Data Carrier batch creates Hive SQL jobs. Each Hive SQL job is equivalent to the concurrent data synchronization of multiple tables or partitions.

2) Data Migration Demonstration on MMA

2.1 Prepare the Environment

MMA requires JDK V1.6 or later and Python V3 or later, as shown in the following figure. The host that runs MMA submits Hive SQL jobs through the Hive client. The host should be able to access the Hive Server and connect to MaxCompute. The right side of the following figure shows a scenario where the customer found an issue when synchronizing data based on MMA. In this example, you have an IDC and Elastic Compute Service (ECS) instance in Alibaba Cloud and have connected the IDC to Alibaba Cloud through a private line. Before installing MMA, you can directly access MaxCompute from the ECS instance but will fail to access MaxCompute from machines in the IDC. In this case, add a virtual border router (VBR) to the private line, and subsequently, connect the IDC to the ECS instance and even to MaxCompute through the network.

2.2 Download and Compile the Toolkit

Download the compiled toolkit, as shown in the following figure. You can also download the source code from the GitHub address available on the MMA website, and then compile it locally based on your Hive version.

2.3 Perform MMA Agent Operations

  • Use meta-carrier to ingest Hive metadata: Install the Hadoop environment on the host in advance with a local Hive Server. Download and decompress the odps-data-carrier.zip package locally. After decompression, the following directories are displayed:
  • The bin directory contains the key files of MMA: meta-carrier, meta-processor, odps_ddl_runner (used to batch create tables), and hive_udtf_sql_runner (used to synchronize data). The libs directory contains the JAR package and library that MMA depends on. The res/console/bin directory contains the ODPSCMD tool and the odps_config.ini configuration file.

2.4 Use DataWorks to Automatically Migrate Data and Workflows

MMA V1.0 does not provide workflow migration as a service. Currently, you must use an offline tool. Generate the directory according to the template, as shown in the following figure. If you use open-source components to migrate workflows, you can store the configuration in the corresponding directory according to the template. If you do not use open-source components, such as an in-house workflow scheduling and orchestration service, you can generate workflow data based on the directory structure of the template, compress the data into a ZIP archive, and upload it to DataWorks. At present, MMA V1.0 requires you to package the file into a ZIP archive and upload it. The backend automatically parses and loads it to the DataWorks workflow. When the upload is completed, the DataWorks batch generates MaxCompute tables according to the MaxCompute DDL SQL statements. Then, DataWorks initiates a DataX data synchronization job to complete the batch data migration.

2.5 Migration Solutions for Other Types of Jobs

  • UDF and MapReduce job migration: Directly upload the JAR package to MaxCompute to provide V2.0 support and enable the Hive compatibility flag. Set the Hive compatibility flag to true and then migrate the UDFs and MapReduce jobs from Hive to MaxCompute. Note that you cannot directly access the file system, networks, and external data sources from UDFs or MapReduce jobs.
  • External table migration: In principle, you can migrate structured data to tables on MaxCompute. If you need to access external files through external tables, we recommend that you migrate data from Hadoop Distributed File System (HDFS) to OSS or Tablestore and then create external tables on MaxCompute to access the files.
  • Spark job migration: MMA is fully compatible with open-source Spark syntax. You only need to download the Spark On MaxCompute client and add MaxCompute connection parameters when compiling Spark SQL statements. Everything else is the same as the Spark SQL syntax.

2.6 View Migration Evaluation Reports

After the MaxCompute DDL creation, the system generates both the DDL SQL statement and the migration evaluation report report.html. The migration evaluation report is a compatibility report, which specifies whether the mappings between the data structures of Hive tables and the data structures of MaxCompute tables are risky and identifies the level of risk. The report also provides details and warning messages, such as incompatible data types or syntax warnings. Before migration, view this report to evaluate the migration risks.

3) Conclusion

In this article, you have learned how to synchronize Hive data to MaxCompute using MaxCompute Migration Assist (MMA) on Alibaba Cloud. We have also explored the features, technical architecture, and implementation principles of MMA with a demonstration of the data migration using MMA.

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store