Drilling into Big Data — A Gold Mine of Information (1)

What Is Big Data?

  • Volume — Massive amount of data from various sources.
  • Variety — Variety of non-standard formats in which the data is generated both by human and machines.
  • Velocity — High Velocity at which the data is generated, stored, processed and retrieved

Big Data and Analytics

  • Where do I get this data?
  • What sort of data it is?
  • How do I bring this data into a big data environment?
  • Where do I store this received data?
  • How do I process and analyze the stored data?
  • What insights can I bring out of it?
  • How can these insights transform my Business?

Data Sources and Types

  • Transactional Data — The most important data to be considered. It is the data recorded by huge Retailers and B2B companies on daily-basis. It is collected based on every event that occurs, for example, number of products, products purchased, stock modified, customer information, distributors details and lots more.
  • Social Data — Common data or public data which can provide remarkable insights to companies. For example, a customer may Tweet about a product, or like and comment about a purchase. This can help companies predict the consumer behavior, their purchasing patterns and sentiments, which is typically a kind of CRM data.
  • Machinery Data — This is one major source of real-time data where we get data from electronic devices, such as sensors, machines, and even web logs.
  • Structured Data — When you are able to place data in a relational database with a schema enforced, then data is called “Structured”. Analyzing becomes easier due to pre-defined structures and relations between data. A common type of structured data is a table.
  • Unstructured Data — Though Big Data is a collection of variety of data, survey says that 90% of Big Data is unstructured. Data that has its own internal structure but does not clearly fits into a database is termed to be “Unstructured”. This includes text documents, audio, video, image files, mails, presentations, web contents and streaming data.
  • Semi structured Data — This type of data cannot be accommodated in a relational database but can be tagged, which makes analyzing easier. XML, JSON and NoSQL databases are considered to be semi structured.

Big Data Ecosystem


  • Hadoop Distributed File System (HDFS) — a storage part
  • MapReduce — a processing part
  • HDFS — It is the primary storage system used by Hadoop applications. HDFS is a distributed file system that stores files as Data Blocks and replicates it over other nodes.
  • MapReduce — MapReduce receives data from HDFS and splits the input data initially. Now that processing can be done on all data parts simultaneously, which we call distributed processing.

How to Get Data into a Big Data Environment?

  • Sqoop — The word Sqoop is derived from “SQL + Hadoop”, which clearly defines that it helps in transferring data between Hadoop and relational database servers. Thus when the data is structured and in batches, you can use Sqoop as a loading tool to push it into Hadoop.
  • Apache Flume — A Data Flow used for efficiently collecting, aggregating, and pushing large amounts of streaming data into Hadoop.
  • Kafka — It is used on real-time streaming data to provide real time analysis. Thus when data is unstructured and streaming, Kafka and Flume together make the processing pipelines.

Where to Store the Data?

  • HDFS — As said earlier, HDFS will be the primary storage system for Hadoop applications
  • Apache HBase is a column-oriented data store built to run on top of the HDFS.It is a non-relational Hadoop Database.
  • Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data with no single point of failure.

How to Process the Data?

  • Spark — The blooming of Apache Spark overtook MapReduce as Spark can do. In-memory processing while MapReduce has to read from and write to a disk. Hence, Spark is 100 times faster and allows data workers to efficiently execute streaming, machine learning or SQL workloads. Then we also had emerging tools like Storm, Samza and Flink. It’s of user’s choice to shift between one among these based on the requirement.
  • Hive — It makes work easier for SQL developers as it provides a SQL-like interface to the data stored. Apache Hive is a data warehouse software project built on top of Apache Hadoop for querying and analysis.
  • Impala — Impala is similar to hive which is distributed SQL query engine for Apache Hadoop.
  • Apache Pig — Since all the processing tools like MapReduce and Spark need some knowledge of Programming Languages which most of the Data Analysts will not be familiar with, Apache Pig was developed at Yahoo. It uses a language called Pig Latin to analyze massive datasets.

Data Analytics and Business Intelligence Tools

Resource Management and Scheduling

  • YARN — Yet another Resource Negotiator
  • Zookeeper

Big Data in Today’s Business




Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

When and Why We Expand Cicero’s Data Coverage

How to Understand Causality to Improve Analytics for Your Business

Simple Climate Modelling in Python

Understanding the forecasting algorithm: STLF Model

How to Analyze Global Air Quality Using Apache Spark & BigQuery

Co-variance: An intuitive explanation!

The Amazon Management System (2)

Analyzing Employee Satisfaction in Major Consulting Firms from Glassdoor Reviews — Part 3 (Topic…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Industry Use Case of Kubernetes

Explaining Blockchain layers visually (advanced)

Research on Tools to Extract/Sync data from Oracle Cloud Applications

CS371p Spring 2022: Santi Dasari