How to Synchronize Data from Hive to MaxCompute?

1) Features, Technical Architecture, and Principles of MMA

1.1 MMA Features

1.2 MMA Architecture

1.3 Technical Architecture and Principles of MMA Agent

  • Automatically obtain the Hive metadata
  • Generate DDL and user-defined table function (UDTF) statements
  • Create tables in batches and migrate Hive data in batches
  • Meta Carrier automatically extracts the Hive metadata and generates a Hive Metastore structure locally.
  • Meta Processor batch converts Hive metadata into MaxCompute DDL statements based on the results generated by Meta Carrier, including the table creation statements and data type conversion statements.
  • The built-in ODPS Console component allows you to batch create MaxCompute tables by using the MaxCompute DDL statements generated by Meta Processor.
  • Finally, the Data Carrier batch creates Hive SQL jobs. Each Hive SQL job is equivalent to the concurrent data synchronization of multiple tables or partitions.

2) Data Migration Demonstration on MMA

2.1 Prepare the Environment

2.2 Download and Compile the Toolkit

2.3 Perform MMA Agent Operations

  • Use meta-carrier to ingest Hive metadata: Install the Hadoop environment on the host in advance with a local Hive Server. Download and decompress the odps-data-carrier.zip package locally. After decompression, the following directories are displayed:
  • The bin directory contains the key files of MMA: meta-carrier, meta-processor, odps_ddl_runner (used to batch create tables), and hive_udtf_sql_runner (used to synchronize data). The libs directory contains the JAR package and library that MMA depends on. The res/console/bin directory contains the ODPSCMD tool and the odps_config.ini configuration file.
  • Use network-measurement-tool: The network-measurement-tool tests the network from a Hive cluster to each MaxCompute region. Obtain the network transmission speed and the approximate data volume and estimate the data transmission time based on the network transmission speed. The network-measurement-tool connects to the endpoints of MaxCompute nodes in all regions and sorts the endpoints in descending order of connection speed. As shown in the following figure, the speed in HANGZHOU is the fastest, with a connection time of 51 ms, while the speed in KUALA_LUMPUR is the slowest, with a connection time of 3393 ms.
  • Use sql-checker to check whether Hive SQL can be executed on MaxCompute: The sql-checker is used to check the Hive SQL syntax compatibility and determine whether Hive SQL can be run on MaxCompute. Specify the meta-directory for the input parameter, as well as specify the default project and SQL parameters. If the execution result indicates good compatibility, the SQL statement can be run on MaxCompute.
  • Use meta-processor to generate MaxCompute DDL statements and Hive UDTF SQL statements: The metadata of Hive Metastore has been pulled in Step 1. to convert Hive metadata to MaxCompute DDL statements. Run the bin/meta-processor -h command to view the parameters. -i indicates the input parameter while -o indicates the output directory. The -i parameter is the output result of the first command, indicating the metadata directory that stores Hive metadata pulled by meta-carrier, such as /meta processor -i meta -o output. Save the execution result of bin/meta-processor in the output directory. View the tree structure of the output directory. The dma_demo directory named after a MaxCompute project is generated in the output directory. The .sql file in the hive_udtf_sql directory in dma_demo is used for batch data migration. The dma_demo directory also contains the odps_ddl directory, which is used to subsequently batch create tables. The .sql file in the odps_ddl directory contains table creation statements.
  • Use odps_ddl_runner.py to create tables and partitions in batches: After generating the DDL statement, create tables in batches. Tables are batch created using the ODPSCMD tool (client tool). The odps_config file in the first-level directory of the toolkit contains basic parameters, among which project_name, access_id, access_key, and end_point are required. After setting the parameters, create tables in batches. Run the python36 bin/odps_ddl_runner.py -h command on the batch table creation tool. The meta-processor automatically generates an input parameter. The odpscmd parameter is not required because the directory is located by default. When creating tables, start the ODPSCMD tool and submit the table creation statements to MaxCompute through the client tool. Run show table to check whether the five tables were created and then check whether the partitions were created. If partitions are created on Hive and MaxCompute, the table structures on the two sides must be the same.
  • Use hive_udtf_sql_runner.py to migrate data: Run the python36 bin/hive_udtf_sql_runner.py command to read SQL statements in the .sql files in the output directory. Check the parameters in the python36 bin/hive_udtf_sql_runner.py command, where input_all is used to batch migrate all data in the output directory. If you only want to migrate data from a single table or a single partition, use the input_single_file parameter. The parallelism parameter indicates the degree of parallelism (DOP). After data is migrated, check whether the table contains data in MaxCompute. Compare the data in MaxCompute with that in the corresponding table on Hive. If the size is the same, the data in the corresponding table in Hive is the same as that in MaxCompute, indicating that the data was completely migrated.
  • Advanced feature 1: metadata generation for a specified database or table: Specify a table for which you want to generate metadata. The meta-carrier tool captures tables from a specified database.
  • Advanced feature 2: flexible mapping from Hive to MaxCompute: If you need to customize a table on MaxCompute, change the table name by adding a prefix or suffix and change custom field names. For example, modify the JSON file in the MaxCompute DDL statement to customize the table name or field names.
  • Advanced feature 3: data migration from a single table or partition: In the preceding example, data was batch migrated from five tables. Run the drop table inventory command to demonstrate data migration from a single partition. To synchronize the data of only one partition, you need to create a new table. Run the python36 bin/odps_ddl_runner.py command to create a table, specify the output directory, and create tables in batches. Now, you have created the inventory table and its five partitions. If no data exists in the created partitions, use the input_single_file parameter to specify the SQL directory of a specified partition (for example, the second partition) as output/dma_demo/hive_udtf_sql/single_partition/inventory_1.sql. Check the execution result and compare the data in the second partition in Hive with the single partition data migrated to MaxCompute. If the data is consistent, it implies that the migration complete.

2.4 Use DataWorks to Automatically Migrate Data and Workflows

2.5 Migration Solutions for Other Types of Jobs

  • UDF and MapReduce job migration: Directly upload the JAR package to MaxCompute to provide V2.0 support and enable the Hive compatibility flag. Set the Hive compatibility flag to true and then migrate the UDFs and MapReduce jobs from Hive to MaxCompute. Note that you cannot directly access the file system, networks, and external data sources from UDFs or MapReduce jobs.
  • External table migration: In principle, you can migrate structured data to tables on MaxCompute. If you need to access external files through external tables, we recommend that you migrate data from Hadoop Distributed File System (HDFS) to OSS or Tablestore and then create external tables on MaxCompute to access the files.
  • Spark job migration: MMA is fully compatible with open-source Spark syntax. You only need to download the Spark On MaxCompute client and add MaxCompute connection parameters when compiling Spark SQL statements. Everything else is the same as the Spark SQL syntax.

2.6 View Migration Evaluation Reports

3) Conclusion

Original Source:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com