Drilling into Big Data — A Gold Mine of Information (1)

6 min readApr 11, 2019

By Priyankaa Arunachalam, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

The volume of data generated every day is a mystery as it is increasing continually at a rapid rate. Although data is everywhere, the intelligence that we can glean from it matters more. These large volumes of data is what we call “Big Data”. Organizations generate and gather huge volumes of data believing that this data might help them in advancing their products and improving their services. For example, a shop may have its customer information, stock details, purchase history, and website visits.

Often times, organizations store these data for regular business activities but fail to use it for further Analytics and Business Relationships. This data which is unanalyzed and left unused is what we call “Dark Data”.

“Big Data is indeed a buzzword, but it is one that is frankly under-hyped,” Ginni Rometty

The problem of untangling insights from data obtained from multiple sources has been around from the day when software applications were found. This is normally time consuming and becomes obsolete for any form of decision making with the data moving so fast. The main aim of this blog series is to make effective use of big data and extend the use of business intelligence to decipher insights quickly and accurately from raw enterprise data on Alibaba Cloud.

What Is Big Data?

In the simplest terms, when the data you have is too large to be stored and analyzed by traditional databases and processing tools, then it is “Big Data”. If you have heard about the 3Vs of big data, then it is simple to understand the underlying definition of big data.

Volume — Massive amount of data from various sources.
Variety — Variety of non-standard formats in which the data is generated both by human and machines.
Velocity — High Velocity at which the data is generated, stored, processed and retrieved

Big Data and Analytics

Every individual and organization has data in one form or another, which they tried managing using spreadsheets, Word documents, and databases. With emerging technologies, the size and variety of data is increasing day by day, and it is no longer possible to analyze the data through traditional means.

The most important aspect of big data analytics is understanding your data. A good way to do this is to ask yourself these questions:

Where do I get this data?
What sort of data it is?
How do I bring this data into a big data environment?
Where do I store this received data?
How do I process and analyze the stored data?
What insights can I bring out of it?
How can these insights transform my Business?

Before exploring Alibaba Cloud’s E-MapReduce, in this article we will target answering the above listed questions to get started with big data.

Data Sources and Types

Data is typically generated when a user interact with a physical device, software, or system. These interactions can be classified into three types:

Transactional Data — The most important data to be considered. It is the data recorded by huge Retailers and B2B companies on daily-basis. It is collected based on every event that occurs, for example, number of products, products purchased, stock modified, customer information, distributors details and lots more.
Social Data — Common data or public data which can provide remarkable insights to companies. For example, a customer may Tweet about a product, or like and comment about a purchase. This can help companies predict the consumer behavior, their purchasing patterns and sentiments, which is typically a kind of CRM data.
Machinery Data — This is one major source of real-time data where we get data from electronic devices, such as sensors, machines, and even web logs.

For most enterprises, data can be categorized into the following types.

Structured Data — When you are able to place data in a relational database with a schema enforced, then data is called “Structured”. Analyzing becomes easier due to pre-defined structures and relations between data. A common type of structured data is a table.
Unstructured Data — Though Big Data is a collection of variety of data, survey says that 90% of Big Data is unstructured. Data that has its own internal structure but does not clearly fits into a database is termed to be “Unstructured”. This includes text documents, audio, video, image files, mails, presentations, web contents and streaming data.
Semi structured Data — This type of data cannot be accommodated in a relational database but can be tagged, which makes analyzing easier. XML, JSON and NoSQL databases are considered to be semi structured.

Big Data Ecosystem

Hadoop

Whenever we talk about big data, it is not uncommon to hear the phrase Hadoop.

Hadoop is an open source framework that manages distributed storage and data processing for big data applications running in clusters. It is mainly used for batch processing. The core parts of Apache Hadoop are

Hadoop Distributed File System (HDFS) — a storage part
MapReduce — a processing part

Since data is large, Hadoop splits the files into blocks and distributes them across nodes in a cluster, which means every node has a copy of the data.

HDFS — It is the primary storage system used by Hadoop applications. HDFS is a distributed file system that stores files as Data Blocks and replicates it over other nodes.
MapReduce — MapReduce receives data from HDFS and splits the input data initially. Now that processing can be done on all data parts simultaneously, which we call distributed processing.

How to Get Data into a Big Data Environment?

Sqoop — The word Sqoop is derived from “SQL + Hadoop”, which clearly defines that it helps in transferring data between Hadoop and relational database servers. Thus when the data is structured and in batches, you can use Sqoop as a loading tool to push it into Hadoop.
Apache Flume — A Data Flow used for efficiently collecting, aggregating, and pushing large amounts of streaming data into Hadoop.
Kafka — It is used on real-time streaming data to provide real time analysis. Thus when data is unstructured and streaming, Kafka and Flume together make the processing pipelines.

Where to Store the Data?

HDFS — As said earlier, HDFS will be the primary storage system for Hadoop applications
Apache HBase is a column-oriented data store built to run on top of the HDFS.It is a non-relational Hadoop Database.
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data with no single point of failure.

How to Process the Data?

Spark — The blooming of Apache Spark overtook MapReduce as Spark can do. In-memory processing while MapReduce has to read from and write to a disk. Hence, Spark is 100 times faster and allows data workers to efficiently execute streaming, machine learning or SQL workloads. Then we also had emerging tools like Storm, Samza and Flink. It’s of user’s choice to shift between one among these based on the requirement.
Hive — It makes work easier for SQL developers as it provides a SQL-like interface to the data stored. Apache Hive is a data warehouse software project built on top of Apache Hadoop for querying and analysis.
Impala — Impala is similar to hive which is distributed SQL query engine for Apache Hadoop.
Apache Pig — Since all the processing tools like MapReduce and Spark need some knowledge of Programming Languages which most of the Data Analysts will not be familiar with, Apache Pig was developed at Yahoo. It uses a language called Pig Latin to analyze massive datasets.

Data Analytics and Business Intelligence Tools

Now that we have figured out how to collect, store and process the data, we need some tool for visualizing the data to make business intelligence possible. There are various business intelligence tools which can add value to big data like Alibaba Cloud’s DataV and QuickBI.

Resource Management and Scheduling

Apart from this main cycle, we will also be focusing on some Resource Management tools like

YARN — Yet another Resource Negotiator
Zookeeper

Other scheduling tools like Oozie, Azkaban, Cron and Luigi which plays a major role in scheduling the Hadoop and Sqoop jobs when you have ’n’ number of tasks listed.

Big Data in Today’s Business

At the end of the day, it’s up to organizations to use all these data to create valuable insightsand transform their businesses. Every organization has its own data in huge volumes; the more efficient the data is used, the more potential the company has to grow. Business insights produced by this entire play can be utilized by organizations to increase their efficiency and make better decisions — a better way to outsmart their peers and competitors in the market.

In the next article, we will show you how to build a big data environment on Alibaba Cloud with Object Storage Service and E-MapReduce.

Reference:https://www.alibabacloud.com/blog/drilling-into-big-data-a-gold-mine-of-information_594661?spm=a2c41.12741461.0.0