Install a Multi-Node Hadoop Cluster on Alibaba Cloud

By Avi Anish, Alibaba Cloud Community Blog author.

In this tutorial, you will learn how to setup a multi-node Apache Hadoop cluster on Alibaba Cloud Elastic Compute Service (ECS) instances with Ubuntu 18.04 installed. By multi-node Hadoop cluster, what we mean is more than one DataNode will be running in the cluster.

Prerequisites

For this tutorial we will specifically be running DataNode on two different Alibaba Cloud ECS instances. You’ll want to have the two ECS instances to be configured as follows:

  • The first instance (configured as the master machine/system) is hadoop-master with an IP address of 172.31.129.196.
  • The second instance (configured as the slave machine/system) is hadoop-slave with an IP address of 172.31.129.197.

Procedure

Install Java

Now before we go into how to setup a Hadoop multi node cluster, both the systems should have Java 8 installed on them. To do this, first run the below commands on both master and slave machines to install Java 8.

Then, edit /etc/hosts file on both the machines with master and slave machine hostnames and their ip addresses.

Install Hadoop

Next, you’ll want to install hadoop-2.7.3 on the multi node cluster. By using the below command, you will download the Hadoop tar file. First run the command below on the master machine.

Next, extract the downloaded Hadoop tar file with the following command:

Next, run the ls command to check if the Hadoop package for extracted.

Given that two machines–master and slave — are used in this tutorial, we will use a SSH key exchange to log in from the master system to the slave system. To do this, generate a public and a private key using the ssh-keygen command. Then, press ENTER when asked to give a file name.

Both the keys will get stored in .ssh directory. Now copy the public key (id_rsa.pub) to the slave machine inside the .ssh directory in the authorized_keys file. Enter the slave machine password when asked.

Now you can log in to slave machine from master machine without entering any password.

Copy the Hadoop directory to the slave machine.

From master machine, login to slave machine using ssh.

Exit from slave machine.

Configure Hadoop

Now, you can put the Hadoop and Java environment variables in .bashrc file as mentioned below on both master and slave systems.

After this, you’ll want to run the below command to initialize the environment variables.

Not that all the paths are set, you can check the java and Hadoop version installed on both the machines.

Next, create namenode and datanode directories on both the machines.

Next, inside the Hadoop main directory, go to /etc/Hadoop directory where all the Hadoop configuration files are present.

Next, on the master system, edit the masters file:

Also, edit the slaves file:

On the slave system, edit only the slaves file.

Now, we will edit the Hadoop configuration files one by one on both master and slave systems. First, edit the core-site.xml file as mentioned below on both the machines. This file contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

Next, edit the hdfs-site.xml as mentioned below on both the machines. This file contains the configuration settings for HDFS daemons; the Name Node, the secondary Name Node, and the data node.

Edit the mapred-site.xml file on both systems. This file contains the configuration settings for MapReduce daemons.

Finally edit Hadoop environment configuration file on both the systems, and give the JAVA_HOME path.

So far, you’ve installed and configured Apache Hadoop on both master and slave systems.

Start the Multi Node Hadoop Cluster

Now from the master machine, you can go to Hadoop main directory and format the namenode.

You’ll want to start all the Hadoop daemons.

Because this script is Deprecated, you can instead use start-dfs.sh and start-yarn.sh.

Run the jps command on master machine to see if all the Hadoop daemons running including one datanode.

Next, because the other datanode will be running on the slave machine, so log in to the slave machine using SSH from master machine.

Now, datanode is running on both master and slave machines. In other words, you’ve successfully installed the multi-node Apache Hadoop cluster on the Alibaba Cloud ECS instance. To add another node or more nodes to the cluster, you’ll need to repeat the same steps as you did for the slave machine here.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store