MaxCompute Data Ingestion from OSS

By Lin En Shu, Solution Architect

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Alibaba Cloud offers a platform called DataWorks for users to perform data ingestion, data processing and data management in MaxCompute. It provides fully hosted workflow services and a one-stop development and management interface to help enterprises mine and explore the full value of their data. DataWorks uses MaxCompute as its core computing and storage engine, to provide massive offline data processing, analysis, and mining capabilities.

Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB.

In this document, the focus will be on data ingestion from Alibaba Cloud’s Object Storage Service (OSS).

Solution Architecture

In this solution architecture, user will ingest data to MaxCompute ODPS table from OSS, via web based DataWorks platform.

Image for post
Image for post

Prerequisites and Preparation

  1. An Alibaba Cloud Account
  2. A sample dataset.
  3. Define a sample database table.

Getting Started

Setting Up OSS Bucket

Visit the OSS console and select Create Bucket.

Image for post
Image for post

Fill in the bucket information (sample values for this tutorial below).

Image for post
Image for post
Image for post
Image for post

The new bucket name is visible on the left panel of the console.

Image for post
Image for post

Go to Files and Create Directory.

Note that the source file cannot be at the root of the bucket, hence a directory must be created.

Image for post
Image for post
Image for post
Image for post

Go into the directory and upload the source csv file. A sample can be downloaded here.

Image for post
Image for post

After file is successfully uploaded, it will be visible in OSS console.

Image for post
Image for post
Image for post
Image for post

OSS Security Token Authorization

In order for Dataworks to be able to access files from OSS bucket, security token has to be authorized from OSS.

Press Security Token from OSS console.

Image for post
Image for post

Press Start Authorization and OSS security token for sub-account access through RAM and STS will be configured.

Image for post
Image for post
Image for post
Image for post

Setup OSS as Data Source in DataWorks

Go to DataWorks and then Data Integration

Image for post
Image for post

In Data Integration main page, press New Source to create data source sync from OSS

Image for post
Image for post

Select OSS as data source

Image for post
Image for post

Configure the OSS data source information (sample values in this tutorial below)

Image for post
Image for post

After that, press test connectivity to check whether the OSS bucket can be connected from DataWorks.

Image for post
Image for post

If it is successful, a green box will pop up at top right corner saying “connectivity test successfully”

Image for post
Image for post

In DataWorks Data Integration, click on Data Sources on left navigation panel and the newly created data source from OSS will be visible here.

Image for post
Image for post

Data Ingestion: Data Source from OSS

Go to Sync Tasks at left panel in Data Integration

Image for post
Image for post

Press Wizard Mode to setup data ingestion from OSS

Image for post
Image for post

Step 1: Configure Data Ingestion Source

The data source will be the OSS data source that has been created earlier. The object prefix will be the absolute path of the OSS bucket. In this tutorial according to the setup above, it will be “sample_dataset/sample_telco_calls.csv”

Select the version/type of the source file as “csv”, and delimiter of csv is “,” (comma)

If the source data file in OSS has header, select header “Yes”

Press “data preview” to preview the data to validate whether it is the data is correct.

Image for post
Image for post

Step 2: Configure Data Ingestion Target

Choose the odps_first(odps) as data ingestion target. odps_first is the default data repository for MaxCompute.

Before data can be ingestion into MaxCompute, a table has to be created in MaxCompute.

Press Create New Target Table

Image for post
Image for post

Enter the table creation statement. (Sample table in this tutorial)

Image for post
Image for post
Image for post
Image for post

Press next after the table created is selected

Image for post
Image for post

Step 3: Configure Source and Target Column Mapping

It is important to ensure the order of columns in the source data file is correctly mapped to the columns of the table created. The recommended approach is to ensure the columns of the source data’s order is the same as the order of the columns in ODPS table.

Press “peer mapping” to map source to target.

Image for post
Image for post

Press next after the column mapping is done correctly. In this tutorial the columns of the source data file and columns of the ODPS table are the same, hence the straight lines mapped from source to target is correct.

Image for post
Image for post

Step 4: Configure Channel Control

Select Maximum Operating Rate and Concurrent Jobs. Press Next

Image for post
Image for post

Step 5: Configuration Preview

Verify configuration and if everything are correctly configured, press Save

Image for post
Image for post

Name this data ingestion task to save.

Image for post
Image for post

After saved, press “operation” button to initiate data ingestion from OSS.

Image for post
Image for post

Monitor the log at bottom panel to check the status of the data synchronization task.

If the data synchronization ended with return code: [0], it means it is successful.

Image for post
Image for post

Data Validation: Validate the Ingested Data via MaxCompute Data Development

Go to Data Development at the top panel, and select from the table which data has been ingested into.

Run “select * from xxxxx_demo.telco_call_mins_oss ;”

The result will be displayed at the log tab. Validate the data which has been queried from MaxCompute ODPS table with source file.

Image for post
Image for post

Conclusion

Ingesting data from OSS in DataWorks IDE is user friendly and easy, can be done end to end using web-based approach, which enabled customers especially business users to do it quickly and simply, allowing them to focus their time and effort on more important tasks — running computation of big data.

Related Products in this Solution

ProductsProduct Links for ReferenceOSShttps://www.alibabacloud.com/product/ossDataWorkshttps://www.alibabacloud.com/product/ideMaxComputehttps://www.alibabacloud.com/product/maxcompute

Reference:https://www.alibabacloud.com/blog/maxcompute-data-ingestion-from-oss_594310?spm=a2c41.12451005.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store