Processing Open Source Data Stored on OSS with MaxCompute

The latest unstructured computing framework (in beta test) of MaxCompute 2.0 intends to establish a channel between computing and storage from the storage medium and storage format dimensions. This article describes how to store open source data in popular formats (such as ORC, PARQUET, SEQUENCEFILE, RCFILE, AVRO, and TEXTFILE) on OSS and use an unstructured framework to process the data on MaxCompute.

As Alibaba’s most widely used big data platform, MaxCompute stores data measured in exabyte (EB). The huge data volume and requirements on high-performance data reading and writing in large-scale computing pose great challenges for MaxCompute. In the big data era, there are diversified data sources. After over ten years’ development of the Open Source Community, various data formats are constantly emerging. By using a variety of computing frameworks in different scenarios, our users have accumulated data in various formats. The MaxCompute team hopes to figure out methods of opening powerful computing capabilities of MaxCompute for data stored in open source formats and mining information from the data on MaxCompute.

To avoid reinventing the wheel, the implementation method of the Open Source Community is directly invoked in the unstructured framework to parse data in most open source formats. In addition, a seamless connection to MaxCompute is implemented.

Creating an External Table to Bind OSS External Data

The unstructured data framework of MaxCompute uses external tables to enable connections between MaxCompute and various types of data. The method of writing OSS data is similar to that of reading OSS data. To write OSS data, you need to use the CREATE EXTERNAL TABLE statement to first create an external table. The DDL statement format used to create an external table for reading open source data is as follows:

DROP TABLE [IF EXISTS] <external_table>;CREATE EXTERNAL TABLE [IF NOT EXISTS] <external_table>
(<column schemas>)
[PARTITIONED BY (partition column schemas)]
[ROW FORMAT SERDE '<serde class>']
STORED AS <file format>
LOCATION 'oss://${accessKeyId}:${accessKeySecret}@${endpoint}/${bucket}/${userPath}/'

The syntax of this statement is quite similar to the Hive syntax. Pay attention to the following points of the CREATE EXTERNAL TABLE statement:

  1. The STORED AS keyword instead of STORED BY used by a common unstructured external table is used in the statement. This is a unique implementation for reading open source data.
  2. The <column schemas> of the external table must match the schema of data stored on OSS.
  3. The ROW FORMAT SERDE parameter is optional and needs to be used only for special formats such as TEXTFILE.
  4. The STORED AS parameter is followed by file format names such as ORC, PARQUET, RCFILE, SEQUENCEFILE, and TEXTFILE.

In the preceding example, OSS plaintext AK data is specified for LOCATION, which only applies to scenarios where users are insensitive to AK data confidentiality. In scenarios with high data security requirements, for example, in a multi-user scenario or an out-of-band cluster, we recommend that you use the STS or RAM system to perform advance authentication to avoid use of plaintext AK data.

Example 1: Associate Data Stored in PARQUET Format on OSS

The following is a specific example: Assume that some PARQUET files are stored in the same path on OSS, and each PARQUET file stores data with a 16-column schema (including four columns of the bigint type, four columns of the double type, and eight columns of the string type). You can use the following statement to associate the data:

CREATE EXTERNAL TABLE tpch_lineitem_parquet
(
l_orderkey bigint,
l_partkey bigint,
l_suppkey bigint,
l_linenumber bigint,
l_quantity double,
l_extendedprice double,
l_discount double,
l_tax double,
l_returnflag string,
l_linestatus string,
l_shipdate string,
l_commitdate string,
l_receiptdate string,
l_shipinstruct string,
l_shipmode string,
l_comment string
)
STORED AS PARQUET
LOCATION 'oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/parquet_data/';

Example 2: Associate a Partition Table with Data Stored in TEXTFILE Format on OSS

If each line of data is in JSON format, the data can be stored in TEXTFILE format on OSS. If the data is organized using multiple directories on OSS, you can use a MaxCompute partition table to associate with the data. The following statement can be used:

CREATE EXTERNAL TABLE tpch_lineitem_textfile
(
l_orderkey bigint,
l_partkey bigint,
l_suppkey bigint,
l_linenumber bigint,
l_quantity double,
l_extendedprice double,
l_discount double,
l_tax double,
l_returnflag string,
l_linestatus string,
l_shipdate string,
l_commitdate string,
l_receiptdate string,
l_shipinstruct string,
l_shipmode string,
l_comment string
)
PARTITIONED BY (ds string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION 'oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data/';

Subdirectories in the OSS table directory can be organized using partition names, for example:

oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data/ds=20170102/'
oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data/ds=20170103/'
...

In this case, the following DDL statements can be used to add partitions:

ALTER TABLE tpch_lineitem_textfile ADD PARTITION(ds="20170102");
ALTER TABLE tpch_lineitem_textfile ADD PARTITION(ds="20170103");

OSS partition directories may not be organized using partition names or may not be in the table directory, for example:

oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data_20170102/;
oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data_20170103/;
...

In this case, the following DDL statements can be used to add partitions:

ALTER TABLE tpch_lineitem_textfile ADD PARTITION(ds="20170102")
LOCATION 'oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data_20170102/';
ALTER TABLE tpch_lineitem_textfile ADD PARTITION(ds="20170103")
LOCATION 'oss://${accessKeyId}:${accessKeySecret}@oss-cn-hangzhou-zmf.aliyuncs.com/bucket/text_data_20170103/';
...

Reading and Processing Open Source Data on OSS

If you compare the preceding examples, you will find that you only need to change the format name following STORED AS for different file types. The following provides examples about processing of the external table (tpch_lineitem_parquet) corresponding to data in PARQUET format. To process a data file in another format, you only need to specify the file format name (PARQUET, ORC, TEXTFILE, RCFILE, or TEXTFILE) in the DDL statement for creating the external table. The data processing statement is the same.

Directly Reading and Processing Open Source Data on OSS

A created external table can be used as a common MaxCompute table to process the data stored on OSS.

SELECT l_returnflag,
l_linestatus,
SUM(l_extendedprice*(1-l_discount)) AS sum_disc_price,
AVG(l_quantity) AS avg_qty,
COUNT(*) AS count_order
FROM tpch_lineitem_parquet
WHERE l_shipdate <= '1998-09-02'
GROUP BY
l_returnflag,
l_linestatus;

In the preceding statement, the tpch_lineitem_parquet external table is used as a common internal table. The only difference in processing using an external table and an internal table is that during processing using an external table, the internal computing engine of MaxCompute reads the corresponding data in PARQUET format from OSS for processing.

It should be emphasized that external tables are directly used here. Each time data in an external table is read, OSS I/O operations are involved. In addition, the performance optimization of MaxCompute for internal storage becomes ineffective. As a result, performance deteriorates. In scenarios where repeated computing needs to be performed on data and there are high requirements for computing efficiency, we recommend that you import data to MaxCompute before computing.

Note: For the tpch_lineitem_textfile table in the preceding examples, ROW FORMAT and STORED AS are used. Therefore, a flag (using only STORED AS and the default value TRUE of odps.sql.hive.compatible) needs to be set before data reading. Otherwise, an error is reported.

SELECT * FROM tpch_lineitem_textfile LIMIT 1;
FAILED: ODPS-0123131:User defined function exception - Traceback:
com.aliyun.odps.udf.UDFException: java.lang.ClassNotFoundException: com.aliyun.odps.hive.wrapper.HiveStorageHandlerWrapper
--A Hive compatibility flag needs to be set manually.
set odps.sql.hive.compatible=true;
SELECT * FROM tpch_lineitem_textfile LIMIT 1;
+------------+------------+------------+--------------+------------+-----------------+------------+------------+--------------+--------------+------------+--------------+---------------+----------------+------------+-----------+
| l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmode | l_comment |
+------------+------------+------------+--------------+------------+-----------------+------------+------------+--------------+--------------+------------+--------------+---------------+----------------+------------+-----------+
| 5640000001 | 174458698 | 9458733 | 1 | 14.0 | 23071.58 | 0.08 | 0.06 | N | O | 1998-01-26 | 1997-11-16 | 1998-02-18 | TAKE BACK RETURN | SHIP | cuses nag silently. quick |
+------------+------------+------------+--------------+------------+-----------------+------------+------------+--------------+--------------+------------+--------------+---------------+----------------+------------+-----------+

Importing Open Source Data on OSS before Computing

First create an internal table tpch_lineitem_internal whose schema is the same as that of an external table, import open source data on OSS to the MaxCompute internal table, and store the data in cFile format on MaxCompute.

CREATE TABLE tpch_lineitem_internal LIKE tpch_lineitem_parquet;INSERT OVERWRITE TABLE tpch_lineitem_internal
SELECT * FROM tpch_lineitem_parquet;

Perform the same operations on the internal table.

SELECT l_returnflag,
l_linestatus,
SUM(l_extendedprice*(1-l_discount)) AS sum_disc_price,
AVG(l_quantity) AS avg_qty,
COUNT(*) AS count_order
FROM tpch_lineitem_internal
WHERE l_shipdate <= '1998-09-02'
GROUP BY
l_returnflag,
l_linestatus;

Data computing is far more efficient if data is imported to MaxCompute in advance.

Conclusion

Open source data formats are generated by various data processing ecosystems. The unstructured data processing framework of MaxCompute implements interconnection between computing and storage, thereby enabling Alibaba Cloud’s core computing platform to access various types of data. Based on this implementation, apps that depend on different data formats can be implemented on MaxCompute. We will provide specific case studies and introductions for the apps such as DNA computing. We expect more apps that can process and analyze open source data to achieve success based on the powerful computing capabilities of MaxCompute.

Reference: https://www.alibabacloud.com/blog/processing-open-source-data-stored-on-oss-with-maxcompute_594002?spm=a2c4.12048833.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store