Embrace Serverless Analytics with Alibaba Cloud

Is This Article Series for Me?

Prerequisites

What Is Serverless Computing?

Why Serverless Computing?

Cost Effective

Zero Administration

Low Overhead

Serverless Computing in Analytics

  1. Extracting the data from multiple sources, transforming it, Storing in data warehouse.
  2. Transforming it again to make it suitable for data targets or BI systems.

Data Lake Analytics

  1. Serverless Architecture (Pay-Per Execution)
  2. Data Federation (Cloud Native Data Sources)
  3. Database Like User Experience (SQL Compatible)
  4. High-performance Analysis Engine (XIHE Engine)

Scenario — 1 (OSS + DLA + Quick BI)

Understanding the Data

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

Ingesting the Data into OSS

ossutil mb oss://bucket [--acl=acl] [--storage-class sc] [-c file]
ossutil64 mb oss://apachelogs
  1. The bucket name must comply with the naming conventions.
  2. The bucket name must be unique among all existing buckets in Alibaba Cloud OSS.
  3. The bucket name cannot be changed after being created.
ossutil cp src-url dest-url
ossutil64 cp accesslog.txt oss://apachelogs/

Processing the Data stored in OSS Using DLA

CREATE SCHEMA my_test_schema with DBPROPERTIES (
LOCATION = 'oss://xxx/xxx/',
catalog='oss');
CREATE SCHEMA apacheloganalyticsschema with DBPROPERTIES (
LOCATION = 'oss://apachelogs /',
catalog='oss' );
CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
[STORE AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
LOCATION oss_path
CREATE EXTERNAL TABLE apacheloganalyticstable (
host STRING,
identity STRING,
user_id STRING,
time_stamp STRING,
request STRING,
status STRING,
size STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" )
STORED AS TEXTFILE LOCATION 'oss://apachelogs/accesslog.txt';
select * from apacheloganalyticstable limit 5;
select count (distinct host) as "Unique Host" from apacheloganalyticstable where status="200";

Summary

Resources

  1. Nasa Dataset Source — https://www.kaggle.com/vinzzhang/nasa-access-log-1995#NASA_access_log_95
  2. Icons — Icons8.com

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Cloud Storage Picks with Alibaba Cloud

Using Istio to support Service Mesh on Multiple Alibaba Cloud Kubernetes Clusters

Microservices can help you manage legacy code

Qt Slots Signals

Tutorial

A good Python plotting tool for nii image: Nilearn

Integration Testing with the Cypress Automation Framework

A GIF of a test run in Cypress

Top Armenian Tech Telegram Channels and Groups

How to Install and Configure Tripwire IDS on Ubuntu 16.04

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Integrating Apache Pulsar with BigQuery

How Google BigQuery Secure your Data | Offering of Google Dataprep

GCP — Execute Jar on Databricks from Airflow — Big Data Processing

Data Migration Considerations