Migrate Data from Hadoop Hive to MaxCompute

By Si Xiang

This document describes how to migrate data from Hadoop Hive to Alibaba Cloud MaxCompute.

1. Prepare the Environment

1.1 Hadoop Cluster Environment

Before migrating data from Hadoop Hive, ensure that your Hadoop cluster works properly. The following Hadoop environments are supported:

  • HDFS 2.8.5
  • YARN 2.8.5
  • Hive 3.1.1
  • Ganglia 3.7.2
  • Spark 2.3.2
  • HUE 4.1.0
  • Zeppelin 0.8.0
  • Tez 0.9.1
  • Sqoop 1.4.7
  • Pig 0.14.0
  • Knox 1.1.0
  • ApacheDS 2.0.0

1.2 Prepare Hadoop Hive Data

Hive script:

insert into hive_sale values
('2019-04-14','外套','品牌A','lilei',3,500.6,7),
('2019-04-15','生鲜','品牌B','lilei',1,303,8),
('2019-04-16','外套','品牌C','hanmeimei',2,510,2),
('2019-04-17','卫浴','品牌A','hanmeimei',1,442.5,1),
('2019-04-18','生鲜','品牌D','hanmeimei',2,234,3),
('2019-04-19','外套','品牌B','jimmy',9,2000,7),
('2019-04-20','生鲜','品牌A','jimmy',5,45.1,5),
('2019-04-21','外套','品牌E','jimmy',5,100.2,4),
('2019-04-22','生鲜','品牌G','peiqi',10,5560,7),
('2019-04-23','卫浴','品牌F','peiqi',1,445.6,2),
('2019-04-24','外套','品牌A','ray',3,777,3),
('2019-04-25','卫浴','品牌G','ray',3,122,3),
('2019-04-26','外套','品牌C','ray',1,62,7);

Log on to the Hadoop cluster, create a Hive SQL script, and run Hive commands to initialize the script.

Image for post
Image for post

Query tables.

Image for post
Image for post

1.3 MaxCompute Environment

Activate MaxCompute. For more information, see: https://www.alibabacloud.com/help/doc-detail/58226.htm

Install and configure a MaxCompute client. For more information, see: https://www.alibabacloud.com/help/doc-detail/27804.htm

1.4 Create a Table in MaxCompute

Image for post
Image for post

You need to consider data type mapping between Hive and MaxCompute when creating a table. For more information, see: https://www.alibabacloud.com/help/doc-detail/54081.htm
Create a table by using the odpscmd CLI tool. For how to install and configure odpscmd, see: https://www.alibabacloud.com/help/doc-detail/27804.htm
Note: MaxCompute 2.0 supports basic and complex data types. For more information, see: https://www.alibabacloud.com/help/doc-detail/27821.htm

2. Migrate Data from Hadoop Hive to Alibaba Cloud MaxCompute

2.1 Upload Files through Tunnel

2.1.1 Create Hive Data Files

Log on to Hive and run the relevant SQL statements. The following example shows how to separate the data that is exported to the local device with commas (,) by line.

Image for post
Image for post

List data files.

Image for post
Image for post

2.1.2 Upload Files by Running the Tunnel Command

Go to the MaxCompute console and run the Tunnel upload command to upload data.

Image for post
Image for post

2.2 Upload Files through Data Integration in DataWorks

2.2.1 Create a Custom Resource Group

In most cases, the network between the MaxCompute project data node and the data node of the Hadoop cluster is unreachable. You can customize a resource group to run the synchronization task of DataWorks on the master node of the Hadoop cluster. In general, the network between the master node and data node of the Hadoop cluster is reachable.

Run the hadoop dfsadmin –report command to view the data node of the Hadoop cluster.

Image for post
Image for post

As shown in the preceding figure, the data node has only an internal network address and cannot communicate with the default resource group of DataWorks. Therefore, you need to customize a resource group and configure the master node to run the DataWorks synchronization task.

Go to the Data Integration page of DataWorks, choose Resource Group from the left-side navigation pane, and click Add Resource Group, as shown in the following figure.

Image for post
Image for post

When adding a server, enter the UUID, IP address, and other information about the server. The IP address must be the Internet IP address of the master node. The internal network IP address may be unreachable.

To obtain the UUID, run the dmidecode | grep UUID command on the master node CLI, as shown in the following figure.

Image for post
Image for post

Ensure that the network between the master node and DataWorks is reachable after the server is added. Install the agent of the custom resource group as instructed and check whether the agent status is Available. If yes, the custom resource group is added.

Image for post
Image for post

2.2.2 Create a Data Source

After you create a project in DataWorks, the data source is set to odps_first by default. Therefore, you only need to add a Hadoop cluster data source as follows: Go to the Data Integration page of DataWorks, choose Data Source from the left-side navigation pane, and click Add Data Source. In the Add Data Source dialog box that appears, select HDFS.

Image for post
Image for post

In the dialog box that appears, set Data Source Name and DefaultFS.
If the Hadoop cluster is an HA cluster, the address is hdfs://IP:8020. If the Hadoop cluster is a non-HA cluster, the address is hdfs://IP:9000. In this example, the Hadoop server is connected to DataWorks through the Internet. Therefore, enter the Internet IP address.

Image for post
Image for post

After the configuration is complete, click Test Connectivity. If a message appears, indicating that the connectivity test is successful, the data source is added.

2.2.3 Configure a Data Synchronization Task

On the Data Analytics page of DataWorks, right-click the new Data Integration node and choose Create Data Integration Node > Data Sync from the context menu. In the Apply Template dialog box that appears, select a data source type, as shown in the following figure.

Image for post
Image for post

The script is as follows:

The path parameter indicates the location where the data is stored in the Hadoop cluster. To confirm the storage location, log on to the master node and
run the hdfs dfs –ls /user/hive/warehouse/hive_sale command.
After the configuration is complete, click Run. If a message appears, indicating that the task runs successfully, the synchronization task has been completed.

Image for post
Image for post

2.2.4 Verify the Results

On the Data Analytics or Query page of DataWorks, run select * FROM hive_sale to verify the results, as shown in the following figure.

Image for post
Image for post

You can also perform an SQL table query by using odpscmd.

Image for post
Image for post

Original Source

https://www.alibabacloud.com/blog/migrate-data-from-hadoop-hive-to-maxcompute_594979?spm=a2c41.13103412.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store