MaxCompute Data Ingestion from OSS

7 min readJan 2, 2019

By Lin En Shu, Solution Architect

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Alibaba Cloud offers a platform called DataWorks for users to perform data ingestion, data processing and data management in MaxCompute. It provides fully hosted workflow services and a one-stop development and management interface to help enterprises mine and explore the full value of their data. DataWorks uses MaxCompute as its core computing and storage engine, to provide massive offline data processing, analysis, and mining capabilities.

Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB.

In this document, the focus will be on data ingestion from Alibaba Cloud’s Object Storage Service (OSS).

Solution Architecture

In this solution architecture, user will ingest data to MaxCompute ODPS table from OSS, via web based DataWorks platform.

Prerequisites and Preparation

An Alibaba Cloud Account
A sample dataset.
Define a sample database table.

Getting Started

Setting Up OSS Bucket

Visit the OSS console and select Create Bucket.

Fill in the bucket information (sample values for this tutorial below).

The new bucket name is visible on the left panel of the console.

Go to Files and Create Directory.

Note that the source file cannot be at the root of the bucket, hence a directory must be created.

Go into the directory and upload the source csv file. A sample can be downloaded here.

After file is successfully uploaded, it will be visible in OSS console.

OSS Security Token Authorization

In order for Dataworks to be able to access files from OSS bucket, security token has to be authorized from OSS.

Press Security Token from OSS console.

Press Start Authorization and OSS security token for sub-account access through RAM and STS will be configured.

Setup OSS as Data Source in DataWorks

Go to DataWorks and then Data Integration

In Data Integration main page, press New Source to create data source sync from OSS

Select OSS as data source

Configure the OSS data source information (sample values in this tutorial below)

After that, press test connectivity to check whether the OSS bucket can be connected from DataWorks.

If it is successful, a green box will pop up at top right corner saying “connectivity test successfully”

In DataWorks Data Integration, click on Data Sources on left navigation panel and the newly created data source from OSS will be visible here.

Data Ingestion: Data Source from OSS

Go to Sync Tasks at left panel in Data Integration

Press Wizard Mode to setup data ingestion from OSS

Step 1: Configure Data Ingestion Source

The data source will be the OSS data source that has been created earlier. The object prefix will be the absolute path of the OSS bucket. In this tutorial according to the setup above, it will be “sample_dataset/sample_telco_calls.csv”

Select the version/type of the source file as “csv”, and delimiter of csv is “,” (comma)

If the source data file in OSS has header, select header “Yes”

Press “data preview” to preview the data to validate whether it is the data is correct.

Step 2: Configure Data Ingestion Target

Choose the odps_first(odps) as data ingestion target. odps_first is the default data repository for MaxCompute.

Before data can be ingestion into MaxCompute, a table has to be created in MaxCompute.

Press Create New Target Table

Enter the table creation statement. (Sample table in this tutorial)

Press next after the table created is selected

Step 3: Configure Source and Target Column Mapping

It is important to ensure the order of columns in the source data file is correctly mapped to the columns of the table created. The recommended approach is to ensure the columns of the source data’s order is the same as the order of the columns in ODPS table.

Press “peer mapping” to map source to target.

Press next after the column mapping is done correctly. In this tutorial the columns of the source data file and columns of the ODPS table are the same, hence the straight lines mapped from source to target is correct.

Step 4: Configure Channel Control

Select Maximum Operating Rate and Concurrent Jobs. Press Next

Step 5: Configuration Preview

Verify configuration and if everything are correctly configured, press Save

Name this data ingestion task to save.

After saved, press “operation” button to initiate data ingestion from OSS.

Monitor the log at bottom panel to check the status of the data synchronization task.

If the data synchronization ended with return code: [0], it means it is successful.

Data Validation: Validate the Ingested Data via MaxCompute Data Development

Go to Data Development at the top panel, and select from the table which data has been ingested into.

Run “select * from xxxxx_demo.telco_call_mins_oss ;”

The result will be displayed at the log tab. Validate the data which has been queried from MaxCompute ODPS table with source file.

Conclusion

Ingesting data from OSS in DataWorks IDE is user friendly and easy, can be done end to end using web-based approach, which enabled customers especially business users to do it quickly and simply, allowing them to focus their time and effort on more important tasks — running computation of big data.