Processing OSS Data Files in Different Formats Using Data Lake Analytics

10 min readOct 25, 2018

Alibaba Cloud Data Lake Analytics (DLA) is a serverless interactive query and analysis service in Alibaba Cloud. You can query and analyze data stored in Object Storage Service (OSS) and Table Store instances simply by running standard SQL statements, without the necessity of moving the data.

Currently, DLA has been officially launched on Alibaba Cloud. You can apply for a trial of this out-of-the-box data analysis service.

Visit the official documentation page to apply for activation of the DLA service.

In addition to plain text files such as CSV and TSV files, DLA can also query and analyze data files in other formats, such as ORC, Parquet, JSON, RCFile, and Avro. DLA can even query geographical JSON data in line with the ESRI standard and files matching the specified regular expressions.

This article describes how to use DLA to analyze files stored in OSS instances based on the file format. DLA provides various built-in Serializers/Deserializers (SerDes) for file processing. Instead of compiling programs by yourself, you can choose one or more SerDes to match formats of data files in your OSS instances. Contact us if the SerDes do not meet your needs for processing special file formats.

Storage Format and SerDe

After creating a table using DLA, run the SHOW CREATE TABLE statement to query the full table creation statement.

CREATE EXTERNAL TABLE nation (
    N_NATIONKEY INT, 
    N_NAME STRING, 
    N_REGIONKEY INT, 
    N_COMMENT STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE 
LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';

After the table is created, view the details about the table you just created.

mysql> show create table nation;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE `nation`(
  `n_nationkey` int,
  `n_name` string,
  `n_regionkey` int,
  `n_comment` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'='|',
  'serialization.format'='|')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'oss://test-bucket-julian-1/tpch_100m/nation'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false',
  'numFiles'='1',
  'numRows'='-1',
  'rawDataSize'='-1',
  'totalSize'='2224',
  'transient_lastDdlTime'='1528440011')
 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.81 sec)

If you compare the table creation statement with the table details, and you will notice that STORED AS TEXTFILE in the table creation statement has been replaced with ROW FORMAT SERDE … STORED AS INPUTFORMAT … OUTPUTFORMAT.

DLA uses INPUTFORMAT to read the data files stored in the OSS instance and uses SERDE to resolve the table records from the files.

The following table lists the file formats that DLA supports currently. To create a table for a file of any of the following formats, simply run STORED AS. DLA selects suitable SERDE, INPUTFORMAT, and OUTPUTFORMAT.

Storage formatDescriptionSTORED AS TEXTFILEThe data file is stored in the plain text format. This is the default file format. Each row in the file corresponds to a record in the table. Common JSON data also belongs to this type.STORED AS ORCThe data file is stored in the ORC format.STORED AS PARQUETThe data file is stored in the Parquet format.STORED AS RCFILEThe data file is stored in the RCFile format.STORED AS AVROThe data file is stored in the Avro format.

You can also run STORED AS with the specified SerDe and a special column separator.

The following section describes some examples.

Examples

CSV Files

CSV files are actually plain text files. You can run STORED AS TEXTFILE to create a table for a CSV file.

Separate columns with commas (,) by running ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’.

Common CSV files

For example, the content of the data file oss://bucket-for-testing/oss/text/cities/city.csv is as follows:

Beijing,China,010
ShangHai,China,021
Tianjin,China,022

The table creation statement is as follows:

CREATE EXTERNAL TABLE city (
    city STRING, 
    country STRING, 
    code INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION 'oss://bucket-for-testing/oss/text/cities';

Process quoted fields with OpenCSVSerde

Note the following when using OpenCSVSerde:

You can specify the field separator, field content quote character, and escape character for a row field, for example, WITH SERDEPROPERTIES (“separatorChar” = “,”, “quoteChar” = “`”, “escapeChar” = “” ).
Row separators inside a field are not supported.
All fields are of the STRING type.
You can use a function in the SQL statement to convert data of other types and then process the data.

CREATE EXTERNAL TABLE test_csv_opencsvserde (
  id STRING,
  name STRING,
  location STRING,
  create_date STRING,
  create_timestamp STRING,
  longitude STRING,
  latitude STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"quoteChar"="\"",
"escapeChar"="\\"
)
STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';

Custom separator

You must customize the column separator (FIELDS TERMINATED BY), escape character (ESCAPED BY), and row terminator (LINES TERMINATED BY).

You must specify the characters in the table creation statement.

ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'

TSV Files

Similar to CSV files, TSV files are plain text files. Columns are separated with tab.

For example, the content of the data file oss://bucket-for-testing/oss/text/cities/city.tsv is as follows:

Beijing    China    010
ShangHai    China    021
Tianjin    China    022

The table creation statement is as follows:

CREATE EXTERNAL TABLE city (
    city STRING, 
    country STRING, 
    code INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE 
LOCATION 'oss://bucket-for-testing/oss/text/cities';

Files with Multi-Character Data Field Separators

If your data field separator contains multiple characters, you can use the following sample table creation statement. The data field separator of each row is “||”, which can be replaced with the specific separator string.

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)

Example:

CREATE EXTERNAL TABLE test_csv_multidelimit (
  id STRING,
  name STRING,
  location STRING,
  create_date STRING,
  create_timestamp STRING,
  longitude STRING,
  latitude STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)
STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';

JSON Files

JSON files that can be processed by DLA are stored as plain text. When creating a table, you must run STORED AS TEXTFILE and define SERDE.

In a JSON file, each row must contain a complete JSON object.

For example, the following file format is unacceptable:

{"id": 123, "name": "jack", 
"c3": "2001-02-03 12:34:56"}
{"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

The file must be changed to:

{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"}
{"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

JSON Data without Nesting

The table creation statement is as follows:

CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 'oss://path/to/t1/directory';

JSON File with Nesting

Use struct and array to define the nested JSON data.

For example, the following is your raw data: (NOTE: A complete piece of JSON data must be placed in one row no matter whether nesting is required, so that the data can be processed by DLA.)

{       "DocId": "Alibaba",         "User_1": {             "Id": 1234,             "Username": "bob1234",          "Name": "Bob",          "ShippingAddress": {                    "Address1": "969 Wenyi West St.",                     "Address2": null,                       "City": "Hangzhou",                      "Province": "Zhejiang"           },              "Orders": [{                            "ItemId": 6789,                                 "OrderDate": "11/11/2017"                       },                      {                               "ItemId": 4352,                                 "OrderDate": "12/12/2017"                       }               ]       } }

After being formatted using the online JSON formatting tool, the data content is as follows:

{
    "DocId": "Alibaba", 
    "User_1": {
        "Id": 1234, 
        "Username": "bob1234", 
        "Name": "Bob", 
        "ShippingAddress": {
            "Address1": "969 Wenyi West St.", 
            "Address2": null, 
            "City": "Hangzhou", 
            "Province": "Zhejiang"
        }, 
        "Orders": [
            {
                "ItemId": 6789, 
                "OrderDate": "11/11/2017"
            }, 
            {
                "ItemId": 4352, 
                "OrderDate": "12/12/2017"
            }
        ]
    }
}

The table creation statement is as follows: (NOTE: The path specified in LOCATION must be the directory of the JSON data file and all JSON files in the directory must be identified as the table data.)

CREATE EXTERNAL TABLE json_table_1 (
    docid string,
    user_1 struct<
            id:INT,
            username:string,
            name:string,
            shippingaddress:struct<
                            address1:string,
                            address2:string,
                            city:string,
                            province:string
                            >,
            orders:array<
                    struct<
                        itemid:INT,
                        orderdate:string
                    >
            >
    >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';

Query the table:

select * from json_table_1;+---------+----------------------------------------------------------------------------------------------------------------+
| docid   | user_1                                                                                                         |
+---------+----------------------------------------------------------------------------------------------------------------+
| Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] |
+---------+----------------------------------------------------------------------------------------------------------------+

For the nested structure defined by struct, you can use “.” for hierarchical object reference. For the array structure defined by array, you can use “[array subscript]” for object reference. (NOTE: The array subscript starts from 1.)

select DocId,
       User_1.Id,
       User_1.ShippingAddress.Address1,
       User_1.Orders[1].ItemId
from json_table_1
where User_1.Username = 'bob1234'
  and User_1.Orders[2].OrderDate = '12/12/2017';+---------+------+--------------------+-------+
| DocId   | id   | address1           | _col3 |
+---------+------+--------------------+-------+
| Alibaba | 1234 | 969 Wenyi West St. |  6789 |
+---------+------+--------------------+-------+

Use JSON Functions to Process Data

For example, store the nested JSON value of “value_string” as a string:

{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}

After being formatted using the online JSON formatting tool, the data content is as follows:

{
    "data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", 
    "ts": 1524550275112, 
    "value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"张三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"
}

The table creation statement is as follows:

CREATE external TABLE json_table_2 (
   data_key string,
   ts bigint,
   value_string string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';

After the table is created, query it:

select * from json_table_2;+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","张三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false}       |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The following SQL statement is an example to show how to use json_parse, json_extract_scalar, json_extract, and other common JSON functions:

mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2;+--------+
| _col0  |
+--------+
| John    |
+--------+mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') 
from (
  select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2
) json_obj
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';+-----------+
| _col0     |
+-----------+
| 0.0.0.0/0 |
+-----------+mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2)
select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask')
from json_obj 
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';+-----------+
| _col0     |
+-----------+
| 0.0.0.0/0 |
+-----------+

ORC Files

ORC is an optimized column-store file format supported by the Apache open source project Hive. Compared with CSV files, ORC files do not require large storage space and ensure better query performance.

For ORC files, you only need to run STORED AS ORC when creating a table.

For example,

CREATE EXTERNAL TABLE orders_orc_date (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS ORC 
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';

Parquet Files

Parquet is a column-store file format supported by the Apache open source project Hadoop.

When using DLA to create a table, you must run STORED AS PARQUET.

For example,

CREATE EXTERNAL TABLE orders_parquet_date (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS PARQUET 
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';

RCFile Files

RCFile is a column-store file format that effectively stores relational table structures in a distributed system and can be efficiently read and processed.

When using DLA to create a table, you must run STORED AS RCFILE.

For example,

CREATE EXTERNAL TABLE lineitem_rcfile_date (
    L_ORDERKEY INT, 
    L_PARTKEY INT, 
    L_SUPPKEY INT, 
    L_LINENUMBER INT, 
    L_QUANTITY DOUBLE, 
    L_EXTENDEDPRICE DOUBLE, 
    L_DISCOUNT DOUBLE, 
    L_TAX DOUBLE, 
    L_RETURNFLAG STRING, 
    L_LINESTATUS STRING, 
    L_SHIPDATE DATE, 
    L_COMMITDATE DATE, 
    L_RECEIPTDATE DATE, 
    L_SHIPINSTRUCT STRING, 
    L_SHIPMODE STRING, 
    L_COMMENT STRING
) 
STORED AS RCFILE
LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'

Avro Files

When using DLA to create a table for an Avro file, you must run STORED AS AVRO. The defined field must meet the schema of the Avro file.

If you are not sure whether the defined field meets the schema, obtain the schema using the tool provided by Avro and create a table accordingly.

Download avro-tools-.jar to your local machine from the Apache Avro website and run the following command to obtain the schema of the Avro file:

java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro
{
  "type" : "record",
  "name" : "doctors",
  "namespace" : "testing.hive.avro.serde",
  "fields" : [ {
    "name" : "number",
    "type" : "int",
    "doc" : "Order of playing the role"
  }, {
    "name" : "first_name",
    "type" : "string",
    "doc" : "first name of actor playing role"
  }, {
    "name" : "last_name",
    "type" : "string",
    "doc" : "last name of actor playing role"
  } ]
}

The table creation statement is as follows. The name field corresponds to the column name in the table and the type must be converted to that supported by Hive, as listed in the following table.

CREATE EXTERNAL TABLE doctors(
number int,
first_name string,
last_name string)
STORED AS AVRO
LOCATION 'oss://mybucket-for-testing/directory/to/doctors';

In most cases, an Avro type can be directly converted to the corresponding Hive type. If a type does not have the corresponding Hive type, the type is converted to a similar one. The following table describes the relationship between Avro types and Hive types.

Avro typeHive typenullvoidbooleanbooleanintintlongbigintfloatfloatdoubledoublebytesbinarystringstringrecordstructmapmaplistarrayunionunionenumstringfixedbinary

Files Matching Regular Expressions

Generally, files of this type are stored in the OSS instance in the plain text format. Each row indicates a record in the table and can match a regular expression.

For example, Apache WebServer log files are of this type.

The content of a log file is as follows:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"

Each row of the file is expressed using the following regular expression and columns are separated with space:

([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?

The table creation statement for the preceding file format is as follows:

CREATE EXTERNAL TABLE serde_regex(
  host STRING,
  identity STRING,
  userName STRING,
  time STRING,
  request STRING,
  status STRING,
  size INT,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-for-testing/datasets/serde/regex';

Query result

mysql> select * from serde_regex;
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| host      | identity | userName | time                         | request                                     | status | size | referer | agent                                                                                                                    |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| 127.0.0.1 | -        | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0"               | 200    | 2326 | NULL    | NULL                                                                                                                     |
| 127.0.0.1 | -        | -     | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200    | 5864 | -       | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+

Geographical JSON Data Files of Esri ArcGIS

DLA supports the SerDe to process geographical JSON data files of Esri ArcGIS. For more information about the geographical JSON data format, visit https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats.

Example:

CREATE EXTERNAL TABLE IF NOT EXISTS california_counties
(
    Name string,
    BoundaryShape binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'

Conclusion

The preceding examples show that Alibaba Cloud Data Lake Analytics (DLA) supports files in most open source storage formats. The speed for DLA to query and analyze data varies depending on the format and size of the file stored in the OSS instance. We recommend that you set the file to the ORC format for storage and query.

To accelerate the query, DLA is continuously optimized and will support more data sources in the future to provide better big data analysis experience for you.

Reference:https://www.alibabacloud.com/blog/processing-oss-data-files-in-different-formats-using-data-lake-analytics_594067?spm=a2c41.12144679.0.0

Processing OSS Data Files in Different Formats Using Data Lake Analytics

Storage Format and SerDe

Examples

CSV Files

TSV Files

Files with Multi-Character Data Field Separators

JSON Files

ORC Files

Parquet Files

RCFile Files

Avro Files

Files Matching Regular Expressions

Geographical JSON Data Files of Esri ArcGIS

Conclusion

Written by Alibaba Cloud

No responses yet