How to Setup Hadoop Cluster Ubuntu 16.04

By Hitesh Jethva, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

Hadoop is a free, open-source, scalable, and fault-tolerant framework written in Java that provides an efficient framework for running jobs on multiple nodes of clusters. Hadoop can be setup on a single machine or a cluster of machines. You can easily scale it to thousand machines on the fly. Its scalable architecture distributes workload across multiple machines. Hadoop works in master-slave architecture, there is a single master node and N number of slave nodes. Master node is used to manage and monitor slave nodes. Master node stores the metadata and slave node stores the data and do the actual task.

Hadoop made up of three main components:

  • HDFS: Hadoop Distributed File System is a distributed, reliable, scalable and a highly fault-tolerant file system for data storage.
  • MapReduce: It is a programming model that is designed for large volumes of data in parallel by dividing the work into a set of independent tasks. It distributes work across hundreds or thousands of servers in a Hadoop cluster.
  • YARN: Also known as a “Yet Another Resource Negotiator” is the resource management layer of Hadoop. It is responsible for managing computing resources in clusters and using them for scheduling users’ applications.

In this tutorial, we will learn how to setup an Apache Hadoop on a single node cluster in an Alibaba Cloud Elastic Compute Service (ECS) instance with Ubuntu 16.04.

Prerequisites

  • A fresh Alibaba Cloud ECS instance with Ubuntu 16.04 server installed.
  • A static IP address 192.168.0.104 is configured on the instance
  • A Root password is setup on the instance.

Launch Alibaba Cloud ECS Instance

First, log in to your https://ecs.console.aliyun.com">Alibaba Cloud ECS Console. Create a new ECS instance, choosing Ubuntu 16.04 as the operating system with at least 2GB RAM. Connect to your ECS instance and log in as the root user.

Once you are logged into your Ubuntu 16.04 instance, run the following command to update your base system with the latest available packages.

Getting Started

Hadoop is written in Java, so you will need to install Java to your server. You can install it by just running the following command:

Once Java is installed, verify the version of the Java using the following command:

Output:

Next, create a new user account for Hadoop. You can do this by running the following command:

Next, you will also need to set up SSH key-based authentication. First, login to hadoop user with the following command:

Next, generate rsa key using the following command:

Output:

Next, give proper permission to the authorized key:

Next, check the key based authentication using the following command:

Install Hadoop

Before starting, you will need to download the latest version of the Hadoop from their official website. You can download it with the following command:

Once the download is completed, extract the downloaded file with the following command:

Next, move the extracted directory to the /opt with the following command:

Next, change the ownership of the hadoop directory using the following command:

Next, you will need to set an environment variable for Hadoop. You can do this by editing .bashrc file:

First, log in to hadoop user:

Next, open .bashrc file:

Add the following lines at the end of the file:

Save and close the file, when you are finished. Then, initialize the environment variables using the following command:

Next, you will also need to setup Java environment variable for Hadoop. You can do this by editing hadoop-env.sh file:

First, find the default Java path using the following command:

Output:

Now, open hadoop-env.sh file and paste above output in the hadoop-env.sh file:

Make the following changes:

Save and close the file, when you are finished.

Configure Hadoop

Next, you will need to configure multiple configuration files to setup Hadoop infrastructure. First, log in with hadoop user and create a directory for hadoop file system storage:

First, you will need to edit core-site.xml file. This file contains the Hadoop port number information, file system allocated memory, data store memory limit and the size of Read/Write buffers.

Make the following changes:

Save the file, then open the hdfs-site.xml file. This file contains the replication data value, namenode path and datanode path for local file systems.

Make the following changes:

Save the file, then open the mapred-site.xml file.

Make the following changes:

Save the file, then open the yarn-site.xml file:

Make the following changes:

Save and close the file, when you are finished.

Format Namenode and Start Hadoop Cluster

Hadoop is now installed and configured. It’s time to initialize HDFS file system. You can do this by formatting Namenode:

You should see the following output:

Next, change the directory to the /opt/hadoop/sbin and start the Hadoop cluster using the following command:

Output:

Output:

Next, check the status of the service using the following command:

Output:

Access Hadoop Services

Hadoop is now installed and configured, it’s time to access Hadoop different services through web browser.

By default, Hadoop NameNode service started on port 9870. You can access it by visiting the URL http://192.168.0.104:9870 in your web browser. You should see the following image:

Image for post
Image for post

You can get information about Hadoop cluster by visiting the URL http://192.168.0.104:8088 as below:

Image for post
Image for post

You can get details about secondary namenode by visiting the URL http://192.168.0.104:9864 as below:

Image for post
Image for post

You can also get details about DataNode by visiting the URL http://192.168.0.104:8042/node in your web browser as below:

Image for post
Image for post

Test Hadoop

To test Hadoop file system cluster. Create a directory in the HDFS file system and copy a file from local file system to HDFS storage.

First, create a directory in HDFS file system using the following command:

Next, copy a file named .bashrc from the local file system to HDFS storage using the following command:

Now, verify the hadoop distributed file system by visiting the URL http://192.168.0.104:9870/explorer.html#/hadooptest in your web browser. You will be redirected to the following page:

Image for post
Image for post

Now, copy hadooptest directory from the hadoop distributed file system to the local file system using the following command:

Output:

By default, Hadoop services are not starting at system boot time. You can enable Hadoop services to start at boot time by editing /etc/rc.local file:

Make the following changes:

Save and close the file, when you are finished.

Next time you restart your system the Hadoop services will be started automatically.

Related Alibaba Cloud Products

Resource Orchestration Service provides developers and system managers with a simple method to create and manage their Alibaba Cloud resources. Through ROS you can use text files in JSON format to define any required Alibaba Cloud resources, dependencies between resources, and configuration details.

ROS offers a template for resource aggregation and blueprint architecture that can be used as code for development, testing, and version control. Templates can be used to deliver Alibaba Cloud resources and system architectures. Based on the template, API, and SDK, you can then conveniently manage your Alibaba Cloud resource by code. ROS produce is free of charge for Alibaba Cloud users.

Reference:

https://www.alibabacloud.com/blog/how-to-setup-hadoop-cluster-ubuntu-16-04_593808?spm=a2c41.11767180.0.0

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store