Drilling into Big Data — Getting started with OSS and EMR (2)

By Priyankaa Arunachalam, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

In the first article of the series, we have walked through the basics of big data. In this article, we will make the environment ready. In the early days, setting up a big data environment itself was a big deal. Nowadays, with the emerging technologies on cloud, the number of process have been reduced making things simpler. This article deals with various big data solutions from Alibaba Cloud and shows you the steps to get started with these services.

Data Storage

The most fundamental requirement of Big Data is Storage. Alibaba Cloud’s Object Storage Service (OSS) is a cloud-based storage service. which helps in storing extremely large quantities of data of different types and from different sources. It is well suited for large volumes of multimedia files. Regardless of the data type or the access frequency, OSS can help. It even includes migration tools to move data from on-premises or third-party providers to OSS.

  • On the Home page of Alibaba, move to “Products” tab and select Object Storage Service under Storage.
Image for post

Click on Buy Now. The pricing is based on the amount of data you store. The more you store, the less the cost per-unit is. Alibaba Cloud offers free storage up to 5GB.

Image for post

Agree with the terms and conditions which will enable OSS and you will see the Order Complete page.

Image for post

Now that you can start creating a bucket to use for E-MapReduce. Go to the OSS console and click on Create Bucket.

Image for post

In the Create Bucket wizard, fill in the necessary details. Let’s follow a constant of “demo1” in naming conversions and “Singapore” as Region in the entire article.

Image for post

If needed, change the configurations of the bucket created. On the left panel, you will see the bucket created. Click on it and move to Basic Settings tab.

Image for post

You can change the configurations wherever necessary. Click on configure under logs and enable logs.

Image for post
  • OSS is now ready to use with the logs enabled.

Data Processing

My Storage is all set now. When moving to the word “Data Processing”, we have two main Alibaba Cloud Products to look at.

  • MaxCompute — Alibaba’s platform for processing Big Data
  • E-MapReduce — A rich framework for managing and processing Big Data

In this article, we will focus on Big Data with Alibaba Cloud’s E-MapReduce

What Is E-MapReduce?

Alibaba Cloud Elastic MapReduce, also known as EMR or E-MapReduce, offers a fully managed service which allows you to create Hadoop clusters for Big Data applications within minutes. It is built on ECS and uses open source tools like Apache Hadoop and Spark (covered in the first article) which forms the core of E-MapReduce to quickly process and analyze huge amounts of data through a user-friendly web interface.

Why Choose E-MapReduce?

E-MapReduce takes care of most of the basic tasks required for cluster creation and provisioning, while at the same time it provides an integrated framework for managing and using clusters. It utilizes complete capabilities of Hadoop and Spark, so you need not provision Hadoop right from scratch. Based on Spark –means you can even stream large volumes of data. It easily integrates with other products of Alibaba Cloud such as Alibaba Elastic Computing Services (ECS) and OSS.

What Is a Hadoop Cluster?

We came across the term “Hadoop” in our first article. So, what is a cluster?

A cluster is a collection of nodes where node is a process running on a physical machine. There are two main advantages of a Hadoop cluster. Firstly you have huge data and you can’t expect them to be the same. A Hadoop cluster helps in this scenario, as it divides data into blocks and each node processes the data in parallel. Secondly, big data is growing every day. So there is constant configuration for a cluster setup may need to scale my cluster i.e.) add or remove nodes from my cluster whenever needed. Yes, a Hadoop cluster solves this too as it is linearly scalable.

Hadoop is a master-slave model where the two main components are

  • Master node- A cluster consists of a single Master which runs NameNode, Secondary NameNode and JobTracker. NameNode stores the metadata of HDFS, The secondary NameNode keeps a backup of the NameNode data and the JobTracker monitors the parallel processing of data using MapReduce
  • Worker node- A cluster can have any number of Worker nodes. This component runs a DataNode which stores the actual data and a Task Tracker service which is secondary to the JobTracker.

Types of Clusters

  • Single node cluster- Also known as Pseudo-Distributed cluster where Namenode and Datanode runs on the same machine.
  • Multinode cluster-Also known as Distributed cluster where one node acts as Master and the other nodes as slaves. The default replication factor of these type of clusters is set to 3.
  • High Availability cluster- In standard configuration, the NameNode becomes a single point of failure because the whole cluster becomes unavailable if the Namenode goes down. The reasons for unavailability might be a planned or unplanned event. This cluster allows to run two Namenodes at the same time namely Active NameNode and Standby/Passive NameNode.If one NameNode goes down, the other NameNode will automatically takeover reducing the cluster down time.

In Alibaba each single node is an ECS instance where one will be Master instance and the other will be Worker/Core Instances. Most Business scenarios use Multinode cluster as there is huge data to process and analyze.

Let’s create a simple cluster in EMR.

  • Log in to your Alibaba Cloud account and click on “Console” on top right corner. This leads to a dashboard comprising of various information like resource used, billing, etc.
  • On the left, there are various icons for navigations. Among that select “Products”, and choose E-MapReduce under analysis.
Image for post

This leads to the EMR console

Image for post

You need a default EMR role to start with the service. If you haven’t set this up already, you will see a warning as shown below.

Image for post

In that case, click “Go to RAM” and set up a default EMR role by clicking on Confirm Authorization Policy.

Image for post

Next, make sure you have an AccessKey. On the top right, hover over the user name and select the AccessKey from the drop down.

Image for post

Ignore the Security Tips. Clicking on “Get started with Sub User’s AccessKey” will take you to the Document center where you can find steps to get started

Image for post

Continue with manage access key and proceed with “Create Access Key”

Image for post

In few seconds, you will see the access key created.

Image for post

Now that all prerequisites is set up, decide the zone where the cluster is to be located. If better network connectivity is required, let all your Alibaba Products sit at same zone. As mentioned earlier, we will use “Singapore” in this entire article. Now my OSS and EMR are in same location.

Image for post

Now click on “Create Cluster”. If any other role authorization is requested by Alibaba, continue setting it up which leads to further steps of cluster creation.

Image for post

Alibaba E-MapReduce provides four different cluster types as follows:

  • Hadoop clusters: It provides various Big data tools like
  • Hadoop, Hive, and Spark for distributed Storage and Processing
  • Spark Streaming, Flink, and Storm form the stream processing systems
  • Oozie and Pig for processing and scheduling jobs
  • Druid clusters: Helps in real-time interactive analysis, query large amount of data with low latency. In collaboration with EMR Hadoop, EMR Spark and OSS it offers real-time solutions.
  • Data Science clusters: A cluster better for Data Scientists provisioned specially for big data and AI scenarios, which provides Tensor Flow models in addition.
  • Kafka clusters: A distributed message system of high throughput and scalability, providing a complete service monitoring system.

Software Configuration

For now, we will create Hadoop cluster. Select “Hadoop”. You will have a set of required services with versions mentioned. You can also select additional tools from Optional services.

Image for post

High security mode: In this mode, you can set authentication for the cluster which is turned off by default. Once the Software Configurations are done, click on Next and move to Hardware Configuration.

Hardware Configuration

  • In the Hardware settings tab, you can set up few services which is required by the cluster like Virtual Private Cloud (VPC) ,Virtual switch (VSwitch) and Security group.
  • Network type: When you select the zone, VPC and VSwitch .created will be selected. Else create a new one.
Image for post

Let’s create a new VPC. Move to the VPC console and click on “Create VPC”

Image for post

In the VPC and VSwitch wizard, give the zone and name of VPC and click ok.

Image for post

Clicking on ok will start creating the VPC and VSwitch. You can see “Creating” in the place of ok. Once it is created you can see the window below.

Image for post

Click on complete. Now you have one created. If you don’t see the created one, click on refresh.

Image for post

Once the VPC is created go back to the Hardware Configuration page, and select the one which is created now.

If you are creating a cluster for the first time, there will be no security group to select. Give a name to create a new security group.

Image for post

Since Hadoop is a Master-Slave Model, select the configurations of Master and Core Instance

Image for post

Also select the number of core instances to decide the number of data nodes. Here we have given core instances=2, thus creating a Multinode cluster.

Image for post

Once this is done, you will see the price estimated below. Based on this, you can even change the instance type and disk size. Finally click next.

Basic Configuration

  • On this tab, give the cluster a name, set the log path (which we set up earlier in the OSS).
Image for post

Also authorize the roles and set a password for the cluster which we will use later to access the cluster. Once everything is done, click on ok. Relax for few seconds, while your cluster is being created. Now move back to EMR console and there is your cluster.

Image for post

Let’s click on “Manage”. You will see all the tools started by default. Anytime you can start, stop, restart the services and even monitor them. Add security and add extra services if needed.

Image for post

Best Practices for Building a Cluster

  • The data volume to be processed is the key to decide the number of nodes and memory capacity for each machine.
  • Run the jobs with default configurations and observe the resources and time taken. Keep enhancing the cluster based on this.

The cluster is now ready for the big deal — get ready to play with Big Data!

In the next article, we will talk about data sources and various data formats to ingest the data into our big data environment.

“We are so obsessed with Big data, we forget how to interpret it,” Danah Boyd

Reference:https://www.alibabacloud.com/blog/drilling-into-big-data-getting-started-with-oss-and-emr-2_594668?spm=a2c41.12760874.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store