Migrate Data from Hadoop Hive to MaxCompute

1. Prepare the Environment

1.1 Hadoop Cluster Environment

Before migrating data from Hadoop Hive, ensure that your Hadoop cluster works properly. The following Hadoop environments are supported:

  • HDFS 2.8.5
  • YARN 2.8.5
  • Hive 3.1.1
  • Ganglia 3.7.2
  • Spark 2.3.2
  • HUE 4.1.0
  • Zeppelin 0.8.0
  • Tez 0.9.1
  • Sqoop 1.4.7
  • Pig 0.14.0
  • Knox 1.1.0
  • ApacheDS 2.0.0

1.2 Prepare Hadoop Hive Data

Hive script:

CREATE TABLE IF NOT EXISTS hive_sale(
create_time timestamp,
category STRING,
brand STRING,
buyer_id STRING,
trans_num BIGINT,
trans_amount DOUBLE,
click_cnt BIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n';
insert into hive_sale values
('2019-04-14','外套','品牌A','lilei',3,500.6,7),
('2019-04-15','生鲜','品牌B','lilei',1,303,8),
('2019-04-16','外套','品牌C','hanmeimei',2,510,2),
('2019-04-17','卫浴','品牌A','hanmeimei',1,442.5,1),
('2019-04-18','生鲜','品牌D','hanmeimei',2,234,3),
('2019-04-19','外套','品牌B','jimmy',9,2000,7),
('2019-04-20','生鲜','品牌A','jimmy',5,45.1,5),
('2019-04-21','外套','品牌E','jimmy',5,100.2,4),
('2019-04-22','生鲜','品牌G','peiqi',10,5560,7),
('2019-04-23','卫浴','品牌F','peiqi',1,445.6,2),
('2019-04-24','外套','品牌A','ray',3,777,3),
('2019-04-25','卫浴','品牌G','ray',3,122,3),
('2019-04-26','外套','品牌C','ray',1,62,7);
hive -f hive_data.sql
hive -e 'show tables';
hive -e 'select * from hive_sale';

1.3 MaxCompute Environment

Activate MaxCompute. For more information, see: https://www.alibabacloud.com/help/doc-detail/58226.htm

1.4 Create a Table in MaxCompute

CREATE TABLE IF NOT EXISTS maxcompute_sale(
create_time STRING,
category STRING,
brand STRING,
buyer_id STRING,
trans_num BIGINT,
trans_amount DOUBLE,
click_cnt BIGINT
);

2. Migrate Data from Hadoop Hive to Alibaba Cloud MaxCompute

2.1 Upload Files through Tunnel

2.1.1 Create Hive Data Files

Log on to Hive and run the relevant SQL statements. The following example shows how to separate the data that is exported to the local device with commas (,) by line.

insert overwrite local directory  '/home/sixiang/' row format delimited fields terminated by ',' select * from hive_sale;

2.1.2 Upload Files by Running the Tunnel Command

Go to the MaxCompute console and run the Tunnel upload command to upload data.

tunnel upload /home/sixiang/000000_0 daniel.maxcompute_sale;

2.2 Upload Files through Data Integration in DataWorks

2.2.1 Create a Custom Resource Group

In most cases, the network between the MaxCompute project data node and the data node of the Hadoop cluster is unreachable. You can customize a resource group to run the synchronization task of DataWorks on the master node of the Hadoop cluster. In general, the network between the master node and data node of the Hadoop cluster is reachable.

2.2.2 Create a Data Source

After you create a project in DataWorks, the data source is set to odps_first by default. Therefore, you only need to add a Hadoop cluster data source as follows: Go to the Data Integration page of DataWorks, choose Data Source from the left-side navigation pane, and click Add Data Source. In the Add Data Source dialog box that appears, select HDFS.

2.2.3 Configure a Data Synchronization Task

On the Data Analytics page of DataWorks, right-click the new Data Integration node and choose Create Data Integration Node > Data Sync from the context menu. In the Apply Template dialog box that appears, select a data source type, as shown in the following figure.

{
"configuration": {
"reader": {
"plugin": "hdfs",
"parameter": {
"path": "/user/hive/warehouse/hive_sale/",
"datasource": "hadoop_to_odps",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "long"
},
{
"index": 5,
"type": "double"
},
{
"index": 6,
"type": "long"
}
],
"defaultFS": "hdfs://xxx.xxx.xxx.xxx:9000",
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileType": "text"
}
},
"writer": {
"plugin": "odps",
"parameter": {
"partition": "",
"truncate": false,
"datasource": "odps_first",
"column": [
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"table": "maxcompute_sale"
}
},
"setting": {
"errorLimit": {
"record": "1000"
},
"speed": {
"throttle": false,
"concurrent": 1,
"mbps": "1",
"dmu": 1
}
}
},
"type": "job",
"version": "1.0"
}

2.2.4 Verify the Results

On the Data Analytics or Query page of DataWorks, run select * FROM hive_sale to verify the results, as shown in the following figure.

Original Source

https://www.alibabacloud.com/blog/migrate-data-from-hadoop-hive-to-maxcompute_594979?spm=a2c41.13103412.0.0

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com