Big Data Application Case Study — Technical Architecture of a Big Data Platform

Data architecture

Technology selection

Next let's move to our ideas about technology selection. I think that there is no best technical architecture, only the most appropriate architecture for our applications. Successful IT planning means starting from the business structure and providing the most appropriate technical architecture for each specific business scenario.

Functional requirements

First, let’s take a look at our functional requirements. Take our advertising business for example. Our goal is to handle 10 billion messages a day. The requirements for big data capabilities are as follows:

Non-functional requirements

· We hope to outsource the infrastructure installation and O&M through the cloud platform.

Why we chose Alibaba Cloud

We finally settled on Alibaba Cloud, especially its E-MapReduce products, after comprehensive inspection of domestic cloud service providers. The cluster is ready shortly after purchase, and Hive, Spark, HBase, and other open-source big data components are available immediately.
First, we had to select the data storage engine.

Technical architecture

E-MapReduce

Alibaba Cloud's E-MapReduce is the core product of our big data platform, which covers Hive, Spark, HBase, Storm, and other core open-source components in the big data field, as well as industry-leading query engines such as Phoenix and Presto. The Zeppelin, Hue, and other interactive components are also out-of-the-box software.
E-MapReduce has frequent new version releases and its components are also constantly updated. But purchased E-MapReduce is not conveniently upgradable. To update in a timely manner, we chose a monthly subscription rather than an annual subscription. After the monthly resources expire, we directly purchase new resources to upgrade them, and the old resources will automatically be destroyed if not renewed. Alibaba's E-MapReduce supports increasing the number of nodes but does not allow reductions. Following the above rolling mode, we can also adjust the cluster size and various configurations at any time.
The above-mentioned rolling mode is feasible for the computing cluster. But what about data storage? The machines used for E-MapReduce all have high configurations and will be a waste if used only to store data. Data can be stored in the OSS and loaded with Hive. However, you still need to store data on E-MapReduce to use HBase. Once you put the data on E-MapReduce, the cluster cannot be destroyed at will. Therefore, we separated the data cluster and computing cluster so that the computing cluster can be destroyed and upgraded at any time, while the data cluster is guaranteed to stably provide services over the long term. These two clusters have different configurations. The computing cluster uses an SSD to achieve faster processing, while the data cluster (HBase) uses ultra cloud disks to achieve a larger capacity.
Then in what scenarios is the pay-as-you-go option used? According to our calculation, if the computing duration is longer than seven days, it would be more cost-effective to purchase monthly subscription clusters directly. Pay-as-you-go clusters can be used for temporary bursts of computing tasks.

Ticket management

Ticket service is the most attractive reason for us to choose Alibaba's cloud services. Our O&M teams often encounter complicated issues requiring urgent solutions. The team members can then conveniently open a ticket to ask the Alibaba's engineers for help. The process of communication on the issue also allows us to learn new things. We have learned a lot from Alibaba Cloud engineers.

Software overview

Based on the technical overview, the software design in our technical architecture is as follows:

Sample scenarios

Batch calculation: LogTail + LogHub + LogShipper + OSS + Hive + SparkSQL
Batch calculation focuses on data collection. LogTail configures the collection rules, LogShipper automatically delivers the data to the OSS, Hive directly loads the data to form a data warehouse, and SparkSQL enables direct query of data in Hive on the Zeppelin interface. The entire ETL process is very smooth, with almost no coding effort required.

Prospect

Spark 2.0 was released, Hadoop 3.0 released Alpha, and HBase 2.0 released SNAPSHOT. Many features in these components are highly anticipated. We will pay close attention to E-MapReduce new releases of Alibaba Cloud, in hopes to try out the new open-source components soon.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com