Using DataX-On-Hadoop to Migrate Data from Hadoop to MaxCompute

How DataX-On-Hadoop Works

What Is DataX-On-Hadoop?

How to Run DataX-On-Hadoop

./bin/hadoop jar datax-jar-with-dependencies.jar 
com.alibaba.datax.hdfs.odps.mr.HdfsToOdpsMRJob ./bvt_case/speed.json
{
"core": {
"transport": {
"channel": {
"speed": {
"byte": "-1",
"record": "-1"
}
}
}
},
"job": {
"setting": {
"speed": {
"byte": 1048576
},
"errorLimit": {
"record": 0
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/tmp/test_datax/big_data*",
"defaultFS": "hdfs://localhost:9000",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": ","
}
},
"writer": {
"name": "odpswriter",
"parameter": {
"project": "",
"table": "",
"partition": "pt=1,dt=2",
"column": [
"id",
"name"
],
"accessId": "",
"accessKey": "",
"truncate": true,
"odpsServer": "http://service.odps.aliyun.com/api",
"tunnelServer": "http://dt.odps.aliyun.com",
"accountType": "aliyun"
}
}
}
]
}
}

Advanced Configuration Parameters for DataX-On-Hadoop Tasks

{
"core": {
"transport": {
"channel": {
"speed": {
"byte": "-1",
"record": "-1"
}
}
}
},
"job": {
"setting": {
"speed": {
"byte": 1048576
},
"errorLimit": {
"record": 0
}
},
"content": [
{
"reader": {},
"writer": {}
}
]
}
}
"path": "/tmp/test_datax/dt=${dt}/abc.txt"
./bin/hadoop jar datax-jar-with-dependencies.jar com.alibaba.datax.hdfs.odps.mr.HdfsToOdpsMRJob 
datax.json -p "-Ddt=20170427 -Dbizdate=123" -t hdfs_2_odps_mr

HDFS Reader

Introduction

  1. Supports the TextFile, ORCFile, RCFile, SequenceFile, CSV, and Parquet file formats. What is stored in the file must be a two-dimensional table in a logic sense.
  2. Supports reading multiple types of data (represented by String) and supports column pruning and column constants.
  3. Supports recursive reading and regular expressions “*” and “?”.
  4. Supports ORCFile with Snappy or Zlib compression.
  5. Supports SequenceFile with LZO compression.
  6. Supports concurrent reading of multiple files.
  7. Supports the following compression formats for CSV files: .gzip, .bz2, .zip, .lzo, .lzo_deflate, and .snappy.

Function Description

{
"core": {
"transport": {
"channel": {
"speed": {
"byte": "-1048576",
"record": "-1"
}
}
}
},
"job": {
"setting": {
"speed": {
"byte": 1048576
},
"errorLimit": {
"record": 0
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/tmp/test_datax/*",
"defaultFS": "hdfs://localhost:9000",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": ","
}
},
"writer": {}
}
]
}
}
"column": ["*"]
{
"type": "long",
"index": 0 //Obtain the int field from the first column of the local file text.
},
{
"type": "string",
"value": "alibaba" //HDFS Reader internally generates the alibaba string field as the current field.
}
"csvReaderConfig":{
"safetySwitch": false,
"skipEmptyRecords": false,
"useTextQualifier": false
}
"hadoopConfig":{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.youkuDfs.namenode1": "",
"dfs.namenode.rpc-address.youkuDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}

Type Conversion

  1. Long: indicates an integer string in the HDFS file, such as 123456789.
  2. Double: indicates a Double string in the HDFS file, such as 3.1415.
  3. Boolean: indicates a Boolean string in the HDFS file, such as true or false (case-insensitive).
  4. Date: indicates a Date string in the HDFS file, such as 2014–12–31.
  1. The Timestamp data type supported by Hive can be accurate to nanoseconds. Therefore, the Timestamp data stored in TextFile and ORCFile files can be in the format like “2015–08–21 22:40:47.397898389”. If the converted data type is set to Date for DataX, the nanosecond part is truncated after conversion. To retain the nanosecond part, set the converted data type to String for DataX.

Reading by Partition

"path": "/user/hive/warehouse/mytable01/20150820/*"

MaxCompute Writer

Introduction

Implementation Principles

Function Description

{
"core": {
"transport": {
"channel": {
"speed": {
"byte": "-1048576",
"record": "-1"
}
}
}
},
"job": {
"setting": {
"speed": {
"byte": 1048576
},
"errorLimit": {
"record": 0
}
},
"content": [
{
"reader": {},
"writer": {
"name": "odpswriter",
"parameter": {
"project": "",
"table": "",
"partition": "pt=1,dt=2",
"column": [
"col1",
"col2"
],
"accessId": "",
"accessKey": "",
"truncate": true,
"odpsServer": "http://service.odps.aliyun.com/api",
"tunnelServer": "http://dt.odps.aliyun.com",
"accountType": "aliyun"
}
}
}
]
}
}

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Create a Unified User Experience in Internal Tools

Capture and analyse WebLogic logs in OCI

My first tech talk in GDG SURAT

Corporate Developer the RPG

On clouds and business

Network Time Synchronization

How I got Internship at NVIDIA

Docker for SME’s: Watch Out for these Five Traps

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

How to Decommission Hadoop — Datanode : Procedure, Troubleshooting Issues and Resolution

How to migrate large data volumes from Redshift to Clickhouse

Big Data In Hadoop

Sampling an approximate number of rows from huge datasets on AWS S3 using Apache Spark