Migrate Data from Hadoop Hive to MaxCompute
By Si Xiang
This document describes how to migrate data from Hadoop Hive to Alibaba Cloud MaxCompute.
1. Prepare the Environment
1.1 Hadoop Cluster Environment
Before migrating data from Hadoop Hive, ensure that your Hadoop cluster works properly. The following Hadoop environments are supported:
- HDFS 2.8.5
- YARN 2.8.5
- Hive 3.1.1
- Ganglia 3.7.2
- Spark 2.3.2
- HUE 4.1.0
- Zeppelin 0.8.0
- Tez 0.9.1
- Sqoop 1.4.7
- Pig 0.14.0
- Knox 1.1.0
- ApacheDS 2.0.0
1.2 Prepare Hadoop Hive Data
CREATE TABLE IF NOT EXISTS hive_sale(
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n';insert into hive_sale values
Log on to the Hadoop cluster, create a Hive SQL script, and run Hive commands to initialize the script.
hive -f hive_data.sql
hive -e 'show tables';
hive -e 'select * from hive_sale';
1.3 MaxCompute Environment
Activate MaxCompute. For more information, see: https://www.alibabacloud.com/help/doc-detail/58226.htm
Install and configure a MaxCompute client. For more information, see: https://www.alibabacloud.com/help/doc-detail/27804.htm
1.4 Create a Table in MaxCompute
CREATE TABLE IF NOT EXISTS maxcompute_sale(
You need to consider data type mapping between Hive and MaxCompute when creating a table. For more information, see: https://www.alibabacloud.com/help/doc-detail/54081.htm
Create a table by using the odpscmd CLI tool. For how to install and configure odpscmd, see: https://www.alibabacloud.com/help/doc-detail/27804.htm
Note: MaxCompute 2.0 supports basic and complex data types. For more information, see: https://www.alibabacloud.com/help/doc-detail/27821.htm
2. Migrate Data from Hadoop Hive to Alibaba Cloud MaxCompute
2.1 Upload Files through Tunnel
2.1.1 Create Hive Data Files
Log on to Hive and run the relevant SQL statements. The following example shows how to separate the data that is exported to the local device with commas (,) by line.
insert overwrite local directory '/home/sixiang/' row format delimited fields terminated by ',' select * from hive_sale;
List data files.
2.1.2 Upload Files by Running the Tunnel Command
Go to the MaxCompute console and run the Tunnel upload command to upload data.
tunnel upload /home/sixiang/000000_0 daniel.maxcompute_sale;
2.2 Upload Files through Data Integration in DataWorks
2.2.1 Create a Custom Resource Group
In most cases, the network between the MaxCompute project data node and the data node of the Hadoop cluster is unreachable. You can customize a resource group to run the synchronization task of DataWorks on the master node of the Hadoop cluster. In general, the network between the master node and data node of the Hadoop cluster is reachable.
Run the hadoop dfsadmin –report command to view the data node of the Hadoop cluster.
As shown in the preceding figure, the data node has only an internal network address and cannot communicate with the default resource group of DataWorks. Therefore, you need to customize a resource group and configure the master node to run the DataWorks synchronization task.
Go to the Data Integration page of DataWorks, choose Resource Group from the left-side navigation pane, and click Add Resource Group, as shown in the following figure.
When adding a server, enter the UUID, IP address, and other information about the server. The IP address must be the Internet IP address of the master node. The internal network IP address may be unreachable.
To obtain the UUID, run the dmidecode | grep UUID command on the master node CLI, as shown in the following figure.
Ensure that the network between the master node and DataWorks is reachable after the server is added. Install the agent of the custom resource group as instructed and check whether the agent status is Available. If yes, the custom resource group is added.
2.2.2 Create a Data Source
After you create a project in DataWorks, the data source is set to odps_first by default. Therefore, you only need to add a Hadoop cluster data source as follows: Go to the Data Integration page of DataWorks, choose Data Source from the left-side navigation pane, and click Add Data Source. In the Add Data Source dialog box that appears, select HDFS.
In the dialog box that appears, set Data Source Name and DefaultFS.
If the Hadoop cluster is an HA cluster, the address is hdfs://IP:8020. If the Hadoop cluster is a non-HA cluster, the address is hdfs://IP:9000. In this example, the Hadoop server is connected to DataWorks through the Internet. Therefore, enter the Internet IP address.
After the configuration is complete, click Test Connectivity. If a message appears, indicating that the connectivity test is successful, the data source is added.
2.2.3 Configure a Data Synchronization Task
On the Data Analytics page of DataWorks, right-click the new Data Integration node and choose Create Data Integration Node > Data Sync from the context menu. In the Apply Template dialog box that appears, select a data source type, as shown in the following figure.
The script is as follows:
The path parameter indicates the location where the data is stored in the Hadoop cluster. To confirm the storage location, log on to the master node and
run the hdfs dfs –ls /user/hive/warehouse/hive_sale command.
After the configuration is complete, click Run. If a message appears, indicating that the task runs successfully, the synchronization task has been completed.
2.2.4 Verify the Results
On the Data Analytics or Query page of DataWorks, run select * FROM hive_sale to verify the results, as shown in the following figure.
You can also perform an SQL table query by using odpscmd.