Processing OSS Data Files in Different Formats Using Data Lake Analytics

Alibaba Cloud Data Lake Analytics (DLA) is a serverless interactive query and analysis service in Alibaba Cloud. You can query and analyze data stored in Object Storage Service (OSS) and Table Store instances simply by running standard SQL statements, without the necessity of moving the data.

Currently, DLA has been officially launched on Alibaba Cloud. You can apply for a trial of this out-of-the-box data analysis service.

Visit the official documentation page to apply for activation of the DLA service.

In addition to plain text files such as CSV and TSV files, DLA can also query and analyze data files in other formats, such as ORC, Parquet, JSON, RCFile, and Avro. DLA can even query geographical JSON data in line with the ESRI standard and files matching the specified regular expressions.

This article describes how to use DLA to analyze files stored in OSS instances based on the file format. DLA provides various built-in Serializers/Deserializers (SerDes) for file processing. Instead of compiling programs by yourself, you can choose one or more SerDes to match formats of data files in your OSS instances. Contact us if the SerDes do not meet your needs for processing special file formats.

Storage Format and SerDe

After creating a table using DLA, run the SHOW CREATE TABLE statement to query the full table creation statement.

After the table is created, view the details about the table you just created.

If you compare the table creation statement with the table details, and you will notice that STORED AS TEXTFILE in the table creation statement has been replaced with ROW FORMAT SERDE … STORED AS INPUTFORMAT … OUTPUTFORMAT.

DLA uses INPUTFORMAT to read the data files stored in the OSS instance and uses SERDE to resolve the table records from the files.

The following table lists the file formats that DLA supports currently. To create a table for a file of any of the following formats, simply run STORED AS. DLA selects suitable SERDE, INPUTFORMAT, and OUTPUTFORMAT.

Storage formatDescriptionSTORED AS TEXTFILEThe data file is stored in the plain text format. This is the default file format. Each row in the file corresponds to a record in the table. Common JSON data also belongs to this type.STORED AS ORCThe data file is stored in the ORC format.STORED AS PARQUETThe data file is stored in the Parquet format.STORED AS RCFILEThe data file is stored in the RCFile format.STORED AS AVROThe data file is stored in the Avro format.

You can also run STORED AS with the specified SerDe and a special column separator.

The following section describes some examples.

Examples

CSV Files

CSV files are actually plain text files. You can run STORED AS TEXTFILE to create a table for a CSV file.

Separate columns with commas (,) by running ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’.

Common CSV files

For example, the content of the data file oss://bucket-for-testing/oss/text/cities/city.csv is as follows:

The table creation statement is as follows:

Process quoted fields with OpenCSVSerde

Note the following when using OpenCSVSerde:

  1. You can specify the field separator, field content quote character, and escape character for a row field, for example, WITH SERDEPROPERTIES (“separatorChar” = “,”, “quoteChar” = “`”, “escapeChar” = “” ).
  2. Row separators inside a field are not supported.
  3. All fields are of the STRING type.
  4. You can use a function in the SQL statement to convert data of other types and then process the data.

Custom separator

You must customize the column separator (FIELDS TERMINATED BY), escape character (ESCAPED BY), and row terminator (LINES TERMINATED BY).

You must specify the characters in the table creation statement.

TSV Files

Similar to CSV files, TSV files are plain text files. Columns are separated with tab.

For example, the content of the data file oss://bucket-for-testing/oss/text/cities/city.tsv is as follows:

The table creation statement is as follows:

Files with Multi-Character Data Field Separators

If your data field separator contains multiple characters, you can use the following sample table creation statement. The data field separator of each row is “||”, which can be replaced with the specific separator string.

Example:

JSON Files

JSON files that can be processed by DLA are stored as plain text. When creating a table, you must run STORED AS TEXTFILE and define SERDE.

In a JSON file, each row must contain a complete JSON object.

For example, the following file format is unacceptable:

The file must be changed to:

JSON Data without Nesting

The table creation statement is as follows:

JSON File with Nesting

Use struct and array to define the nested JSON data.

For example, the following is your raw data: (NOTE: A complete piece of JSON data must be placed in one row no matter whether nesting is required, so that the data can be processed by DLA.)

After being formatted using the online JSON formatting tool, the data content is as follows:

The table creation statement is as follows: (NOTE: The path specified in LOCATION must be the directory of the JSON data file and all JSON files in the directory must be identified as the table data.)

Query the table:

For the nested structure defined by struct, you can use “.” for hierarchical object reference. For the array structure defined by array, you can use “[array subscript]” for object reference. (NOTE: The array subscript starts from 1.)

Use JSON Functions to Process Data

For example, store the nested JSON value of “value_string” as a string:

After being formatted using the online JSON formatting tool, the data content is as follows:

The table creation statement is as follows:

After the table is created, query it:

The following SQL statement is an example to show how to use json_parse, json_extract_scalar, json_extract, and other common JSON functions:

ORC Files

ORC is an optimized column-store file format supported by the Apache open source project Hive. Compared with CSV files, ORC files do not require large storage space and ensure better query performance.

For ORC files, you only need to run STORED AS ORC when creating a table.

For example,

Parquet Files

Parquet is a column-store file format supported by the Apache open source project Hadoop.

When using DLA to create a table, you must run STORED AS PARQUET.

For example,

RCFile Files

RCFile is a column-store file format that effectively stores relational table structures in a distributed system and can be efficiently read and processed.

When using DLA to create a table, you must run STORED AS RCFILE.

For example,

Avro Files

When using DLA to create a table for an Avro file, you must run STORED AS AVRO. The defined field must meet the schema of the Avro file.

If you are not sure whether the defined field meets the schema, obtain the schema using the tool provided by Avro and create a table accordingly.

Download avro-tools-.jar to your local machine from the Apache Avro website and run the following command to obtain the schema of the Avro file:

The table creation statement is as follows. The name field corresponds to the column name in the table and the type must be converted to that supported by Hive, as listed in the following table.

In most cases, an Avro type can be directly converted to the corresponding Hive type. If a type does not have the corresponding Hive type, the type is converted to a similar one. The following table describes the relationship between Avro types and Hive types.

Avro typeHive typenullvoidbooleanbooleanintintlongbigintfloatfloatdoubledoublebytesbinarystringstringrecordstructmapmaplistarrayunionunionenumstringfixedbinary

Files Matching Regular Expressions

Generally, files of this type are stored in the OSS instance in the plain text format. Each row indicates a record in the table and can match a regular expression.

For example, Apache WebServer log files are of this type.

The content of a log file is as follows:

Each row of the file is expressed using the following regular expression and columns are separated with space:

The table creation statement for the preceding file format is as follows:

Query result

Geographical JSON Data Files of Esri ArcGIS

DLA supports the SerDe to process geographical JSON data files of Esri ArcGIS. For more information about the geographical JSON data format, visit https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats.

Example:

Conclusion

The preceding examples show that Alibaba Cloud Data Lake Analytics (DLA) supports files in most open source storage formats. The speed for DLA to query and analyze data varies depending on the format and size of the file stored in the OSS instance. We recommend that you set the file to the ORC format for storage and query.

To accelerate the query, DLA is continuously optimized and will support more data sources in the future to provide better big data analysis experience for you.

Reference:https://www.alibabacloud.com/blog/processing-oss-data-files-in-different-formats-using-data-lake-analytics_594067?spm=a2c41.12144679.0.0

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store