Install a Multi-Node Hadoop Cluster on Alibaba Cloud

By Avi Anish, Alibaba Cloud Community Blog author.

In this tutorial, you will learn how to setup a multi-node Apache Hadoop cluster on Alibaba Cloud Elastic Compute Service (ECS) instances with Ubuntu 18.04 installed. By multi-node Hadoop cluster, what we mean is more than one DataNode will be running in the cluster.

Prerequisites

For this tutorial we will specifically be running DataNode on two different Alibaba Cloud ECS instances. You’ll want to have the two ECS instances to be configured as follows:

  • The first instance (configured as the master machine/system) is hadoop-master with an IP address of 172.31.129.196.
  • The second instance (configured as the slave machine/system) is hadoop-slave with an IP address of 172.31.129.197.

Procedure

Install Java

Now before we go into how to setup a Hadoop multi node cluster, both the systems should have Java 8 installed on them. To do this, first run the below commands on both master and slave machines to install Java 8.

root@hadoop-master:~# sudo add-apt-repository ppa:openjdk-r/ppa
root@hadoop-master:~# sudo apt-get update
root@hadoop-master:~# sudo apt-get install openjdk-8-jdk
root@hadoop-master:~# sudo update-java-alternatives --list
java-1.8.0-openjdk-amd64 1081 /usr/lib/jvm/java-1.8.0-openjdk-amd64
root@hadoop-master:~# java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)

Then, edit /etc/hosts file on both the machines with master and slave machine hostnames and their ip addresses.

root@hadoop-master:~#  vi /etc/hosts/# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.31.129.196 hadoop-master
172.31.129.197 hadoop-slave
127.0.0.1 localhost localhost

Install Hadoop

Next, you’ll want to install hadoop-2.7.3 on the multi node cluster. By using the below command, you will download the Hadoop tar file. First run the command below on the master machine.

root@hadoop-master:~# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Next, extract the downloaded Hadoop tar file with the following command:

root@hadoop-master:~# tar -xzf hadoop-2.7.3.tar.gz

Next, run the ls command to check if the Hadoop package for extracted.

root@hadoop-master:~# ls
hadoop-2.7.3 hadoop-2.7.3.tar.gz

Given that two machines–master and slave — are used in this tutorial, we will use a SSH key exchange to log in from the master system to the slave system. To do this, generate a public and a private key using the ssh-keygen command. Then, press ENTER when asked to give a file name.

root@hadoop-master:~# ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:OawYX8MzsF8iXeF1FhIJcdqtf7bMwaoPsL9Cq0t/Pi4 root@hadoop-master
The key's randomart image is:
+---[RSA 2048]----+
| +o=o+. |
| . *.= |
| . + . . |
| * o . |
| . o S.. . |
| + = Oo .. |
| . o.o... .oo|
| . .E.o. +oo|
| oo.**=o + |
+----[SHA256]-----+

Both the keys will get stored in .ssh directory. Now copy the public key (id_rsa.pub) to the slave machine inside the .ssh directory in the authorized_keys file. Enter the slave machine password when asked.

root@hadoop-master:~# cat .ssh/id_rsa.pub | ssh root@172.31.129.197 'cat >> .ssh/authorized_keys'The authenticity of host '172.31.129.197 (172.31.129.197)' can't be established.
ECDSA key fingerprint is SHA256:XOA5/7EcNPfEu/uNU/os0EekpcFkvIhKowreKhLD2YA.
Are you sure you want to continue connecting (yes/no)? yes
root@172.31.129.197's password:

Now you can log in to slave machine from master machine without entering any password.

Copy the Hadoop directory to the slave machine.

root@hadoop-master:~# scp -r hadoop-2.7.3 root@172.31.129.197:/root/

From master machine, login to slave machine using ssh.

root@hadoop-master:~# ssh root@172.31.129.197
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Welcome to Alibaba Cloud Elastic Compute Service !Last login: Mon Jul 29 22:16:16 2019 from 172.31.129.196
Run ls command to check if the Hadoop directory got copied to the slave machine
root@hadoop-slave:~# ls
hadoop-2.7.3

Exit from slave machine.

root@hadoop-slave:~# exit
logout
Connection to 172.31.129.197 closed.
root@hadoop-master:~#

Configure Hadoop

Now, you can put the Hadoop and Java environment variables in .bashrc file as mentioned below on both master and slave systems.

root@hadoop-master:~# sudo vi .bashrcexport HADOOP_PREFIX="/root/hadoop-2.7.3"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/

After this, you’ll want to run the below command to initialize the environment variables.

root@hadoop-master:~# source .bashrc

Not that all the paths are set, you can check the java and Hadoop version installed on both the machines.

root@hadoop-master:~# java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
root@hadoop-master:~# hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /root/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
root@hadoop-master:~#

Next, create namenode and datanode directories on both the machines.

root@hadoop-master:~# mkdir hadoop-2.7.3/hdfs
root@hadoop-master:~# mkdir hadoop-2.7.3/hdfs/namenode
root@hadoop-master:~# mkdir hadoop-2.7.3/hdfs/datanode

Next, inside the Hadoop main directory, go to /etc/Hadoop directory where all the Hadoop configuration files are present.

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# cd
root@hadoop-master:~# cd hadoop-2.7.3/etc/hadoop/
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# ls
capacity-scheduler.xml kms-log4j.properties
configuration.xsl kms-site.xml
container-executor.cfg log4j.properties
core-site.xml mapred-env.cmd
hadoop-env.cmd mapred-env.sh
hadoop-env.sh mapred-queues.xml.template
hadoop-metrics2.properties mapred-site.xml
hadoop-metrics.properties mapred-site.xml.template
hadoop-policy.xml masters
hdfs-site.xml slaves
httpfs-env.sh ssl-client.xml.example
httpfs-log4j.properties ssl-server.xml.example
httpfs-signature.secret yarn-env.cmd
httpfs-site.xml yarn-env.sh
kms-acls.xml yarn-site.xml
kms-env.sh

Next, on the master system, edit the masters file:

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi masters
hadoop-master

Also, edit the slaves file:

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi slaves
hadoop-master
hadoop-slave

On the slave system, edit only the slaves file.

root@hadoop-slave:~/hadoop-2.7.3/etc/hadoop# vi slaves
hadoop-slave

Now, we will edit the Hadoop configuration files one by one on both master and slave systems. First, edit the core-site.xml file as mentioned below on both the machines. This file contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi core-site.xml<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>

Next, edit the hdfs-site.xml as mentioned below on both the machines. This file contains the configuration settings for HDFS daemons; the Name Node, the secondary Name Node, and the data node.

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop-2.7.3/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop-2.7.3/hdfs/datanode</value>
</property>
</configuration>

Edit the mapred-site.xml file on both systems. This file contains the configuration settings for MapReduce daemons.

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# cp mapred-site.xml.template mapred-site.xml
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml. This file contains the configuration settings for YARN.root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi yarn-site.xml<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Finally edit Hadoop environment configuration file on both the systems, and give the JAVA_HOME path.

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

So far, you’ve installed and configured Apache Hadoop on both master and slave systems.

Start the Multi Node Hadoop Cluster

Now from the master machine, you can go to Hadoop main directory and format the namenode.

root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# cd .. .. 
root@hadoop-master:~/hadoop-2.7.3# bin/hadoop namenode -format

You’ll want to start all the Hadoop daemons.

root@hadoop-master:~/hadoop-2.7.3# ./sbin/start-all.sh

Because this script is Deprecated, you can instead use start-dfs.sh and start-yarn.sh.

Starting namenodes on [hadoop-master]
hadoop-master: starting namenode, logging to /root/hadoop-2.7.3/logs/hadoop-root-namenode-hadoop-master.out
hadoop-master: starting datanode, logging to /root/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop-master.out
hadoop-slave: starting datanode, logging to /root/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop-slave.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /root/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-hadoop-master.out
starting yarn daemons
starting resourcemanager, logging to /root/hadoop-2.7.3/logs/yarn-root-resourcemanager-hadoop-master.out
hadoop-slave: starting nodemanager, logging to /root/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop-slave.out
hadoop-master: starting nodemanager, logging to /root/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop-master.out

Run the jps command on master machine to see if all the Hadoop daemons running including one datanode.

root@hadoop-master:~/hadoop-2.7.3# jps
4144 NameNode
4609 ResourceManager
4725 NodeManager
4456 SecondaryNameNode
4283 DataNode
5054 Jps

Next, because the other datanode will be running on the slave machine, so log in to the slave machine using SSH from master machine.

root@hadoop-master:~/hadoop-2.7.3# ssh root@172.31.129.197
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Welcome to Alibaba Cloud Elastic Compute Service !
Last login: Mon Jul 29 23:12:51 2019 from 172.31.129.196
Run the jps command to check if the datanode in up and running.
root@hadoop-slave:~# jps
23185 DataNode
23441 Jps
23303 NodeManager
root@hadoop-slave:~#

Now, datanode is running on both master and slave machines. In other words, you’ve successfully installed the multi-node Apache Hadoop cluster on the Alibaba Cloud ECS instance. To add another node or more nodes to the cluster, you’ll need to repeat the same steps as you did for the slave machine here.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store