Setup a Single-Node Hadoop Cluster Using Docker

By Avi Anish, Alibaba Cloud Community Blog author.

In this article, we will look at how you can set up Docker to be used to launch a single-node Hadoop cluster inside a Docker container on an Alibaba ECS instance.

Background Information

Before we get into the main part of this tutorial, let’s look at some of the major products discussed in this blog.

First, there’s Docker, which is a very popular containerization tool with which you can create containers where software or other dependencies that are installed run the application. You might have heard about virtual machines. Well, Docker containers are basically a light-weight version of a virtual machine. Creating a docker container to run an application is very easy, and you can launch them on the fly.

Next, there’s Apache Hadoop, which is a core big data framework to store and process Big Data. The storage component of Hadoop is called Hadoop Distributed File system (usually abbreviated HDFS) and the processing component of Hadoop is called MapReduce. Next, there are several daemons that will run inside a Hadoop cluster, which include NameNode, DataNode, Secondary Namenode, ResourceManager, and NodeManager.

Main Tutorial

Now that you know a bit more about what Docker and Hadoop are, let’s look at how you can set up a single node Hadoop cluster using Docker.

First, for this tutorial, we will be using an Alibaba Cloud ECS instance with Ubuntu 18.04 installed. Next, as part of this tutorial, let’s assume that you have docker installed on this ubuntu system. Below are the details of this setup:

Image for post
Image for post
Image for post
Image for post

As a preliminary part of this tutorial, you’ll want to confirm everything’s up and running as it should. First, to confirm that docker is installed on the instance system, you can run the below command to check the version of docker installed on the machine.

Next, to check that docker is running correctly, launch a simple hello world container.

This message shows that your installation appears to be working correctly.

To generate this message, Docker will have completed the following operations in the background:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the “hello-world” image from the Docker Hub (amd64).
  3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
  4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

To try something more ambitious, you can also run an Ubuntu-installed container with this command below:

Next, you can share images, automate workflows, and do more with a free Docker ID, which you can find at Docker’s official website. And, for more examples and ideas, check out Docker’s Get Started guide.

Image for post
Image for post

If you are getting the output you see above, the docker is running properly on your instance, and now we can setup Hadoop inside a docker container. To do so, run a pull command to get the docker image on Hadoop. More specifically, what you’ll see is a Docker Image a file with multiple layers, which you’ll use to deploy containers.

Next, run the below command to the check the list of docker images present on your system.

After that’s out of the way, run the Hadoop docker image inside a docker container.

In the above output you can see, the container is starting all the Hadoop daemons one by one. Just to make sure all the daemons are up and running, run the jps command.

If you get the above output after running the jps command, then you can be assured that all the hadoop daemons are running correctly. After that, run the docker command shown below to get the details on the docker container.

Run the below command to get the IP address on which the container is running.

From the above output, we know that the docker container is running at 172.17.0.2. Following this, the Hadoop cluster web interface can be accessed on port 50070. So, you can open your browser, specifically Mozilla Firefox browser, in your ubuntu machine and go to 172.17.0.2:50070. Your Hadoop Overview Interface will open. You can see from below output that Hadoop is running at port 9000, which is the default port.

Image for post
Image for post

Now scroll down to “Summary”, where you will get the details for your Hadoop cluster, which is running inside the docker container. Live Node 1 means 1 datanode is up and running (single node).

Image for post
Image for post

To access the Hadoop Distributed File System (HDFS), you can go to Utilities -> Browse the file system. You will find the user directory, which is present in HDFS by default.

Image for post
Image for post

Now you’re inside a docker container. You will find bash shell inside the container, not the default ubuntu terminal shell. Let’s run a wordcount mapreduce program on this Hadoop cluster running inside a docker container. This program will take the input containing text and give output as key value pair where key will the work and value will be the number of occurrences of that word. First thing you’ll want to do is go to the Hadoop home directory.

Next, run the hadoop-mapreduce-examples-2.7.0.jar file, which has a wordcount program pre-installed. The output is as follows:

After the mapreduce program has finished executing the operation, run the below command to check the output.

Now you’ve successfully setup a single node Hadoop cluster using Docker. You can check out other articles on Alibaba Cloud to learn more about Docker and Hadoop.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store