Install a Multi-Node Hadoop Cluster on Alibaba Cloud
By Avi Anish, Alibaba Cloud Community Blog author.
In this tutorial, you will learn how to setup a multi-node Apache Hadoop cluster on Alibaba Cloud Elastic Compute Service (ECS) instances with Ubuntu 18.04 installed. By multi-node Hadoop cluster, what we mean is more than one DataNode will be running in the cluster.
Prerequisites
For this tutorial we will specifically be running DataNode on two different Alibaba Cloud ECS instances. You’ll want to have the two ECS instances to be configured as follows:
- The first instance (configured as the master machine/system) is hadoop-master with an IP address of 172.31.129.196.
- The second instance (configured as the slave machine/system) is hadoop-slave with an IP address of 172.31.129.197.
Procedure
Install Java
Now before we go into how to setup a Hadoop multi node cluster, both the systems should have Java 8 installed on them. To do this, first run the below commands on both master and slave machines to install Java 8.
root@hadoop-master:~# sudo add-apt-repository ppa:openjdk-r/ppa
root@hadoop-master:~# sudo apt-get update
root@hadoop-master:~# sudo apt-get install openjdk-8-jdkroot@hadoop-master:~# sudo update-java-alternatives --list
java-1.8.0-openjdk-amd64 1081 /usr/lib/jvm/java-1.8.0-openjdk-amd64root@hadoop-master:~# java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
Then, edit /etc/hosts
file on both the machines with master and slave machine hostnames and their ip addresses.
root@hadoop-master:~# vi /etc/hosts/# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters172.31.129.196 hadoop-master
172.31.129.197 hadoop-slave127.0.0.1 localhost localhost
Install Hadoop
Next, you’ll want to install hadoop-2.7.3 on the multi node cluster. By using the below command, you will download the Hadoop tar file. First run the command below on the master machine.
root@hadoop-master:~# wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
Next, extract the downloaded Hadoop tar file with the following command:
root@hadoop-master:~# tar -xzf hadoop-2.7.3.tar.gz
Next, run the ls
command to check if the Hadoop package for extracted.
root@hadoop-master:~# ls
hadoop-2.7.3 hadoop-2.7.3.tar.gz
Given that two machines–master and slave — are used in this tutorial, we will use a SSH key exchange to log in from the master system to the slave system. To do this, generate a public and a private key using the ssh-keygen
command. Then, press ENTER when asked to give a file name.
root@hadoop-master:~# ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:OawYX8MzsF8iXeF1FhIJcdqtf7bMwaoPsL9Cq0t/Pi4 root@hadoop-master
The key's randomart image is:
+---[RSA 2048]----+
| +o=o+. |
| . *.= |
| . + . . |
| * o . |
| . o S.. . |
| + = Oo .. |
| . o.o... .oo|
| . .E.o. +oo|
| oo.**=o + |
+----[SHA256]-----+
Both the keys will get stored in .ssh
directory. Now copy the public key (id_rsa.pub
) to the slave machine inside the .ssh
directory in the authorized_keys
file. Enter the slave machine password when asked.
root@hadoop-master:~# cat .ssh/id_rsa.pub | ssh root@172.31.129.197 'cat >> .ssh/authorized_keys'The authenticity of host '172.31.129.197 (172.31.129.197)' can't be established.
ECDSA key fingerprint is SHA256:XOA5/7EcNPfEu/uNU/os0EekpcFkvIhKowreKhLD2YA.
Are you sure you want to continue connecting (yes/no)? yes
root@172.31.129.197's password:
Now you can log in to slave machine from master machine without entering any password.
Copy the Hadoop directory to the slave machine.
root@hadoop-master:~# scp -r hadoop-2.7.3 root@172.31.129.197:/root/
From master machine, login to slave machine using ssh.
root@hadoop-master:~# ssh root@172.31.129.197
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) * Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Welcome to Alibaba Cloud Elastic Compute Service !Last login: Mon Jul 29 22:16:16 2019 from 172.31.129.196
Run ls command to check if the Hadoop directory got copied to the slave machine
root@hadoop-slave:~# ls
hadoop-2.7.3
Exit from slave machine.
root@hadoop-slave:~# exit
logout
Connection to 172.31.129.197 closed.
root@hadoop-master:~#
Configure Hadoop
Now, you can put the Hadoop and Java environment variables in .bashrc file as mentioned below on both master and slave systems.
root@hadoop-master:~# sudo vi .bashrcexport HADOOP_PREFIX="/root/hadoop-2.7.3"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
After this, you’ll want to run the below command to initialize the environment variables.
root@hadoop-master:~# source .bashrc
Not that all the paths are set, you can check the java and Hadoop version installed on both the machines.
root@hadoop-master:~# java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)root@hadoop-master:~# hadoop version
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
This command was run using /root/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar
root@hadoop-master:~#
Next, create namenode and datanode directories on both the machines.
root@hadoop-master:~# mkdir hadoop-2.7.3/hdfs
root@hadoop-master:~# mkdir hadoop-2.7.3/hdfs/namenode
root@hadoop-master:~# mkdir hadoop-2.7.3/hdfs/datanode
Next, inside the Hadoop main directory, go to /etc/Hadoop
directory where all the Hadoop configuration files are present.
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# cd
root@hadoop-master:~# cd hadoop-2.7.3/etc/hadoop/
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# ls
capacity-scheduler.xml kms-log4j.properties
configuration.xsl kms-site.xml
container-executor.cfg log4j.properties
core-site.xml mapred-env.cmd
hadoop-env.cmd mapred-env.sh
hadoop-env.sh mapred-queues.xml.template
hadoop-metrics2.properties mapred-site.xml
hadoop-metrics.properties mapred-site.xml.template
hadoop-policy.xml masters
hdfs-site.xml slaves
httpfs-env.sh ssl-client.xml.example
httpfs-log4j.properties ssl-server.xml.example
httpfs-signature.secret yarn-env.cmd
httpfs-site.xml yarn-env.sh
kms-acls.xml yarn-site.xml
kms-env.sh
Next, on the master system, edit the masters file:
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi masters
hadoop-master
Also, edit the slaves file:
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi slaves
hadoop-master
hadoop-slave
On the slave system, edit only the slaves file.
root@hadoop-slave:~/hadoop-2.7.3/etc/hadoop# vi slaves
hadoop-slave
Now, we will edit the Hadoop configuration files one by one on both master and slave systems. First, edit the core-site.xml
file as mentioned below on both the machines. This file contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi core-site.xml<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
Next, edit the hdfs-site.xml
as mentioned below on both the machines. This file contains the configuration settings for HDFS daemons; the Name Node, the secondary Name Node, and the data node.
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/root/hadoop-2.7.3/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/root/hadoop-2.7.3/hdfs/datanode</value>
</property>
</configuration>
Edit the mapred-site.xml
file on both systems. This file contains the configuration settings for MapReduce daemons.
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# cp mapred-site.xml.template mapred-site.xml
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi mapred-site.xml<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
--><!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>Edit yarn-site.xml. This file contains the configuration settings for YARN.root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi yarn-site.xml<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration><!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Finally edit Hadoop environment configuration file on both the systems, and give the JAVA_HOME
path.
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# vi hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
So far, you’ve installed and configured Apache Hadoop on both master and slave systems.
Start the Multi Node Hadoop Cluster
Now from the master machine, you can go to Hadoop main directory and format the namenode.
root@hadoop-master:~/hadoop-2.7.3/etc/hadoop# cd .. ..
root@hadoop-master:~/hadoop-2.7.3# bin/hadoop namenode -format
You’ll want to start all the Hadoop daemons.
root@hadoop-master:~/hadoop-2.7.3# ./sbin/start-all.sh
Because this script is Deprecated, you can instead use start-dfs.sh
and start-yarn.sh
.
Starting namenodes on [hadoop-master]
hadoop-master: starting namenode, logging to /root/hadoop-2.7.3/logs/hadoop-root-namenode-hadoop-master.out
hadoop-master: starting datanode, logging to /root/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop-master.out
hadoop-slave: starting datanode, logging to /root/hadoop-2.7.3/logs/hadoop-root-datanode-hadoop-slave.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /root/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-hadoop-master.out
starting yarn daemons
starting resourcemanager, logging to /root/hadoop-2.7.3/logs/yarn-root-resourcemanager-hadoop-master.out
hadoop-slave: starting nodemanager, logging to /root/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop-slave.out
hadoop-master: starting nodemanager, logging to /root/hadoop-2.7.3/logs/yarn-root-nodemanager-hadoop-master.out
Run the jps
command on master machine to see if all the Hadoop daemons running including one datanode.
root@hadoop-master:~/hadoop-2.7.3# jps
4144 NameNode
4609 ResourceManager
4725 NodeManager
4456 SecondaryNameNode
4283 DataNode
5054 Jps
Next, because the other datanode will be running on the slave machine, so log in to the slave machine using SSH from master machine.
root@hadoop-master:~/hadoop-2.7.3# ssh root@172.31.129.197
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) * Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantageWelcome to Alibaba Cloud Elastic Compute Service !
Last login: Mon Jul 29 23:12:51 2019 from 172.31.129.196Run the jps command to check if the datanode in up and running.
root@hadoop-slave:~# jps
23185 DataNode
23441 Jps
23303 NodeManager
root@hadoop-slave:~#
Now, datanode is running on both master and slave machines. In other words, you’ve successfully installed the multi-node Apache Hadoop cluster on the Alibaba Cloud ECS instance. To add another node or more nodes to the cluster, you’ll need to repeat the same steps as you did for the slave machine here.