Embrace Serverless Analytics with Alibaba Cloud
“Where there is data smoke, there is business fire.” — Thomas Redman
If you are reading this article, I am pretty sure that you are already familiar with cloud computing. Cloud computing has made a huge impact by providing businesses the opportunity to have virtually unlimited computing resources anywhere, anytime. Over the last few months, we can see that the industries and businesses are moving towards “Serverless Computing”. So, it is not surprising to see the footprints of Serverless Computing in Business Intelligence and Analytics (BIA) architectures.
Since serverless computing is a rather novel term, in this article, I will walk you through the concepts of serverless computing and its underlying benefits. I will then talk about Alibaba Cloud Data Lake Analytics (DLA), and discuss how efficient it is when compared to traditional methods of analytics. We will then finish up with typical scenarios of DLA with different use cases as an example.
Is This Article Series for Me?
This article is meant for everyone! This includes students or newcomers who just want to familiarize with general concepts of serverless computing and big data analytics, as well as professional data engineers and analysts who want to leverage serverless analytics to optimize the cost utilization, and time consumption.
This article covers about what is serverless computing, why serverless computing, how serverless architecture deepens its roots in Business Intelligence and Analytics (BIA), and how to leverage Serverless analytics with the help of Alibaba Cloud Data Lake Analytics. We will analyze and visualize data from different data sources, such as from Alibaba Cloud Object Storage Service (OSS), Table Store, and ApsaraDB for RDS, using Alibaba Cloud DLA and Alibaba Cloud Quick BI. At least you need to activate OSS, DLA and Quick BI to make use of this article effectively.
What Is Serverless Computing?
“Focus on your application development, not on how to deploy and maintain the infrastructure”
Serverless Computing doesn’t mean that there is no server, it is software development approach that aims to eliminate the need to manage server on our own. In general, Serverless Computing is a cloud computing model which leads to build more, manage less by avoiding running the virtual resources for the long period of time.
In Serverless Computing, the code runs in “stateless compute clusters that are ephemeral”. i.e. The clusters are automatically provisioned and invoked for the specific tasks and after completion of the tasks, the resources are released. It all happens in matter of seconds, significantly optimizes the resources and reduces the cost.
For illustration, just imagine a machine which starts off to complete a task and stops after completing the task automatically. Serverless computing often refers to FaaS because of “Just run for function”.
Alibaba Cloud provides Function as a Service (FaaS) in the name of Alibaba Cloud Function Compute which is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
The above figure illustrates the key difference between serverless computing, and Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) cloud computing models.
Why Serverless Computing?
Serverless Computing provides virtual resources in a tick of time for a function (a specific task), allowing you to run code more flexibly and reliably. This leads to following benefits are
Unlike PaaS Models (run nonstop to serve the requests), Serverless computing is event-driven (run to complete a task or function) i.e. the resources are allocated on the fly and invoked to serve specific tasks, therefore you only need to pay to the computing time you really need (Pay-Per execution).
With Serverless approaches, we really need not to worry about provisioning the instance and managing it. Serverless applications scale as per the demand autonomously. There is no need of scaling and tuning but still operation team has to monitor the applications.
Due to its layers of abstraction, deployment in the Serverless is less complex. Deploy the code in the environment, then you are go-to-market ready.
Serverless Computing in Analytics
In the pipeline, Business intelligence and Analytics architecture is divided into two important conceptual components to derive business value from the data are ¬
- Extracting the data from multiple sources, transforming it, Storing in data warehouse.
- Transforming it again to make it suitable for data targets or BI systems.
From this illustration, we can see that analytics architecture is a concatenation of storing and transformation processes. It becomes evident that when some of these processes are happening as stateless functions in the cloud which is the only difference in serverless computing.
Data Lake Analytics
“Data Lake refers to storage where we have data in its natural state.”
Alibaba Cloud provides Data Lake Analytics which does not require any ETL tools. It allows user to use standard SQLs and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.
Benefits of Data Lake Analytics:
- Serverless Architecture (Pay-Per Execution)
- Data Federation (Cloud Native Data Sources)
- Database Like User Experience (SQL Compatible)
- High-performance Analysis Engine (XIHE Engine)
Scenario — 1 (OSS + DLA + Quick BI)
We are going to process, transform, analyze and visualize the cold data stored in the OSS using DLA and Quick BI, with an example use case of processing, transforming and analyzing the website log data into consumable insights.
Understanding the Data
The dataset we are going to analyze is NASA’s apache web server access log. Before getting into analyzing it in detail, I like to give an overview of what is apache access.log? and why it is essential to analyze log data?
You can download the dataset from here.
What is Apache access.log?
A log is a documentation of events occurred in a system. Apache access.log is a file which captures the information regarding the events occurred in apache web server.
For instance, when someone visits your website, a log is recorded and stored in the access.log file. It contains information such as client IP, resource path, status codes, browser used, etc.
Apache log format:
"%h %l %u %t \"%r\" %>s %b"
Let’s break down the log format
%h — Client IP
%l — Identity of Client (will return hyphen if the information is not available)
%u — User-ID of Client (will return hyphen if the information is not available)
%t — Timestamp
“%r” — Request Line (it includes http method, resource path, and http protocol used)
%>s — Status Code
%b — Size of the object requested
Finally, the log entry will look like
184.108.40.206 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
Why it is essential to analyze log data?
Log analysis is the process of making sense of system generated logs and it helps businesses to understand their customer behavior, comply with security policies, and comprehend system troubleshooting.
Ingesting the Data into OSS
We need to upload the data into Object Storage Service (OSS) to effectively process, transform and analyze website log data into consumable insights using the Data Lake Analytics (DLA) and Quick BI.
We need OSS command line utility (You can also use the console to create bucket and upload the data into it). Please download and install the tool from the official website. For more details, you can have a look at how to Download and Install ossutil.
After completing the process of downloading, installing and configuring the OSS Utility tool, follow the following processes to ingest the data into the Object Storage Service (OSS).
Create a bucket in OSS
ossutil mb oss://bucket [--acl=acl] [--storage-class sc] [-c file]
ossutil64 mb oss://apachelogs
You can see that bucket is created successfully, now we need to upload the file into that bucket.
- The bucket name must comply with the naming conventions.
- The bucket name must be unique among all existing buckets in Alibaba Cloud OSS.
- The bucket name cannot be changed after being created.
Upload the file into a bucket
ossutil cp src-url dest-url
ossutil64 cp accesslog.txt oss://apachelogs/
You can see that file is ingested into the bucket successfully, now we need to process it using Data Lake Analytics (DLA).
Processing the Data stored in OSS Using DLA
After the completion of uploading data to OSS, we can use Data Lake Analytics to process the data stored in the OSS.
Unlike traditional systems like Hadoop, Spark, Data Lake Analytics uses serverless architecture as discussed earlier in this article. Therefore, users are need not to worry about the provision of infrastructure which is taken care by the vendor itself. You only need to pay for the volume of data gets scanned to produce the result during the execution of query i.e. pay per execution.
So, we only need to use DLL statements to create the table and describe the structure of data stored in OSS for DLA. Currently, you need to apply for DLA to use it, for more details, have a look at Alibaba Cloud Data Lake Analytics.
Assuming you have access to DLA and configured the DLA connection in SQL Workbench, we are going to use apache web log imported into OSS as an example to understand how to use Data Lake Analytics. For more details on how to create a connection in SQL Workbench or in shell or any other Database tools, please have a look at official documentation.
Creating a Schema
CREATE SCHEMA my_test_schema with DBPROPERTIES (
LOCATION = 'oss://xxx/xxx/',
Note: Currently, DLA schema name (in the same region) must be globally unique. If a schema name already exists, an error message will be returned.
CREATE SCHEMA apacheloganalyticsschema with DBPROPERTIES (
LOCATION = 'oss://apachelogs /',
Note: Your OSS LOCATION path must be ended with “/” to indicate that it is a path.
Creating a table
CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
[STORE AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
CREATE EXTERNAL TABLE apacheloganalyticstable (
size STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" )
STORED AS TEXTFILE LOCATION 'oss://apachelogs/accesslog.txt';
You can see that table is created successfully, now we will able to execute ad-hoc queries against the table to see the results.
Querying the database
Data Lake Analytics complies with standard SQLs and supports a variety of functions. Therefore, you can perform ad-hoc queries to access the data stored in the OSS like you do in a common database.
select * from apacheloganalyticstable limit 5;
select count (distinct host) as "Unique Host" from apacheloganalyticstable where status="200";
As you can see that DLA processes the data within a matter of milli-seconds which is blazingly fast, the performance is continuously improving.
I hope that this article gives you a better understanding of serverless computing and why we need to go for serverless computing. We also discussed how serverless architecture can be applied to Business Intelligence and Analytics (BIA) and introduced Alibaba Cloud Data Lake Analytics.
We started off with a scenario of analyzing the cold data stored in the OSS using DLA and Quick BI with a use case. If you followed the steps correctly, you should have successfully ingested data into OSS, created data table in DLA and able to query the data as like we do in database.
In the next article of this series, we will be analyzing and visualizing the data using Quick BI to transform the logs into consumable insights. Stay tuned.
- Nasa Dataset Source — https://www.kaggle.com/vinzzhang/nasa-access-log-1995#NASA_access_log_95
- Icons — Icons8.com