Creating A Simple Hadoop Cluster With VirtualBox

I wanted to get familiar with the big data world, and decided to test Hadoop. Initially I used Cloudera’s pre-built virtual machine with their full Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.

I wondered what it would take to install a small 4-nodes cluster…

I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMWare and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.

Overview

High-Level Diagram of the Virtual Box VM cluster running Hadoop nodes.

High-Level Diagram of the Virtual Box VM cluster running Hadoop nodes.

The overall approach is simple. We create a virtual machine, we configure it with the required parameters and settings to act as a cluster node (specially the network settings). This referenced virtual machine is then cloned as many times as there will be nodes in the Hadoop cluster. Only a limited set of changes are then needed to finalize the node to be operational (only the hostname and IP address need to be defined).

In this article, I created a 4 nodes cluster. The first node, which will run most of the cluster services, requires more memory (8Gb) than the other 3 nodes (2Gb). Overall we will allocate 14Gb of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.

Preparation

The prerequisites for this tutorial is that you should have the latest VirtualBox installed (you can download it for free); We will be using the CentOS 6.5 Linux distribution (you can download the CentOS x86_64bit DVD iso image).

Base VM Image creation

VM creation

Create the reference virtual machine, with the following parameters:

  • Bridge network
  • Enough disk space (more than 40Gb)
  • 2 GB of RAM
  • Setup the DVD to point to the CentOS iso image

when you install CentOS, you can specify the option ‘expert text’, for a faster OS installation with minimum set of packages.

Network Configuration

Perform changes in the following files. These will allow all cluster nodes to interact.

/etc/resolv.conf

search example.com
nameserver 10.0.1.1

/etc/sysconfig/network

NETWORKING=yes
HOSTNAME=base.example.com
GATEWAY=10.0.1.1

/etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
ONBOOT=yes
PROTO=static
IPADDR=10.0.1.200
NETMASK=255.255.255.0

/etc/selinux/config

SELINUX=disabled

/etc/yum/pluginconf.d/fastestmirror.conf

enabled=0

Initialize the network by restarting the network services:

$> chkconfig iptables off
$> /etc/init.d/network restart

Installation of VM Additions

You should now update all the packages and reboot the virtual machine:

$> yum update
$> reboot

In the VirtualBox menu, select Devices, and then Insert Guest…. A DVD iso image with the guest additions is now available in the DVD Player of the VM. Mount the DVD with the following commands to access this DVD:

$> mkdir /media/VBGuest
$> mount -r /dev/cdrom /media/VBGuest

Follow instructions from this web page to install the additions.

Setup Cluster Hosts

Define all the hosts that will be part of the cluster in the /etc/hosts file. This will simplify the access between nodes, specially if you do not have a DNS setup. Add as many hosts as you have nodes in your cluster.

/etc/hosts

10.0.1.201 hadoop1.example.com hadoop1
10.0.1.202 hadoop2.example.com hadoop2
10.0.1.203 hadoop3.example.com hadoop3
10.0.1.204 hadoop4.example.com hadoop4

Setup SSH

To further simplify the access between hosts, install and setup SSH to accept automatically foreign hosts keys.

$> yum -y install perl openssh-clients
$> ssh-keygen (type enter, enter, enter)
$> cd ~/.ssh
$> cp id_rsa.pub authorized_keys

Modify the ssh configuration file. Uncomment the following line and change the value to no; this will prevent the question when connecting with SSH to the host.

/etc/ssh/ssh_config

StrictHostKeyChecking no

Shutdown & Clone

The reference virtual machine is now ready, and is ready to be cloned into new virtual machines that will be cluster nodes.

Shutdown the system with the following command:

$> init 0

in VirtualBox, clone the referenced virtual machine, using the ‘Linked Clone’ option and name the nodes hadoop1, hadoop2, hadoop3 and hadoop4.

For the first node (hadoop1), change the memory settings to 8Gb of memory. Most of the roles will be installed on this node, and therefore it is important that it have sufficient memory available.

Clones Customization

For every node, proceed with the following operations

Modify the hostname of the server, change the following line in the file

/etc/sysconfig/network

HOSTNAME=hadoop[n].example.com

Where [n] = 1..4 (up to the number of nodes)

Modify the fixed IP address of the server, change the following line in the file:

/etc/sysconfig/network-scripts/ifcfg-eth0

IPADDR=10.0.1.20[n]

Where [n] = 1..4 (up to the number of nodes)

Let’s restart the networking services and reboot the server, so that the above changes takes effect:

$> /etc/init.d/network restart
$> init 6

We have 4 running virtual machines with CentOS correctly configured.

Four Virtual Machines running on VirtualBox, ready to be setup in the Cloudera cluster.

Four Virtual Machines running on VirtualBox, ready to be setup in the Cloudera cluster.

Install Cloudera Manager on hadoop1

Download and run the Cloudera Manager Installer, which will simplify greatly the rest of the installation and setup process.

$> curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
$> chmod +x cloudera-manager-installer.bin
$> ./cloudera-manager-installer.bin

Use a web browser and connect to http://hadoop1.example.com:7180 (or http://10.0.1.201:7180 if you have not added the hostnames into a DNS or hosts file).

To continue the installation, you will have to select the Cloudera free license version. You will then have to define which nodes will be used in the cluster. Just enter all the nodes you have defined in the previous steps(e.g. hadoop1.example.com) separated by a space. Click on the “Search” button. You can then used the root password (or the SSH keys you have generated) to automate the connectivty to the different nodes. Install all packages and services onto the 1st node.

Once this is done, you will select additional service components; just select everything by default. The installation will continue and will complete.

Using the Hadoop Cluster

Now that we have an operational Hadoop cluster, there are two main interfaces that you will use to operate the cluster: Cloudera Manager and HUE.

Cloudera Manager

You can access the Cloudera Manager by connecting to the first node web site: http://hadoop1.example.com:7180, the default user to login is ‘admin’ with the password ‘admin’.

Cloudera Manager Homepage, presenting cluster health dashboards

Cloudera Manager Homepage, presenting cluster health dashboards

Hue

Similarly to Cloud Manager, you can access the HUE administration site by accessing: http://hadoop1.example.com:8888, where you will be able to access the different services that you have installed on the cluster.

Hue interface, and here more specifically an Impala saved queries window.

Hue interface, and here more specifically an Impala saved queries window.

Conclusions

I have been able to create a small Hadoop cluster in probably less than a hour, largely thanks to the Cloudera Manager Installer, which simplifies the installation to the simplest of operation. It is now possible to execute and use the various examples installed on the cluster, as well as understand the interactions between the nodes.

I have written an article (Experimenting Hadoop with Real Datasets) that explains how I have used this cluster to process and visualize temperatures over a long period of time in the USA.

Comments and remarks are welcome!

Post navigation


Comments

  • avatar

    Romain

    Nice step by step tutorial and great to see the Hue UI 🙂

  • avatar

    Chris

    Thanks! The next article will be about using the Hue interface as well as using MapReduce with datasets!

    PS: Thanks for the typo correction

  • avatar

    Nishant Bhardwaj

    Hi CJ,
    My host OS is Windows 7 home basic and am running VirtualBox on it. By updating the BIOS settings I was able to install 64-bit version of CentOS 7.0 . I have setup the Bridged network adapter and followed through the network settings as in the post above. But I am still unable to connect to internet from inside the guest OS. Do I need to customize any of the network setting to fit to my environment ? My host ipconfig details as follows:
    Wireless LAN adapter Wireless Network Connection:

    Connection-specific DNS Suffix . :
    Link-local IPv6 Address . . . . . : fe80::99ee:c6f6:ca4f:73de%13
    IPv4 Address. . . . . . . . . . . : 192.168.1.7
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . : 192.168.1.1

    Ethernet adapter Local Area Connection:

    Media State . . . . . . . . . . . : Media disconnected
    Connection-specific DNS Suffix . :

    Ethernet adapter VMware Network Adapter VMnet1:

    Connection-specific DNS Suffix . :
    Link-local IPv6 Address . . . . . : fe80::10e7:b7c5:d2ea:67d0%19
    IPv4 Address. . . . . . . . . . . : 192.168.21.1
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . :

    Ethernet adapter VMware Network Adapter VMnet8:

    Connection-specific DNS Suffix . :
    Link-local IPv6 Address . . . . . : fe80::78:7973:61c6:eba3%20
    IPv4 Address. . . . . . . . . . . : 192.168.132.1
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . :

    Ethernet adapter VirtualBox Host-Only Network:

    Connection-specific DNS Suffix . :
    Link-local IPv6 Address . . . . . : fe80::492f:bdb4:1cce:3ed%27
    IPv4 Address. . . . . . . . . . . : 192.168.56.1
    Subnet Mask . . . . . . . . . . . : 255.255.255.0
    Default Gateway . . . . . . . . . :

  • avatar

    Chris

    Hello,

    You might want to check what is the gateway IP address to your host. In the article, it is assumed to be 10.0.1.1

    If I read correctly the output of your ipconfig, your gateway address is 192.168.1.1. You might want to adapt the IP address scheme accordingly.

    Regards,
    Christian

  • avatar

    Nishant Bhardwaj

    Hi Chris, that solved my problem. The gateway was actually 192.168.1.1 so had to update the configuration files in CentOS guest. The internet is working fine now and I am able to ssh between the servers, hopefully this should work fine from here. For some reason the static ip addresses are not working, i am relying on the fact that dhcp allocation remains constant. I am trying different variations of ifcfg-enp0s3 file and that might yield some results.
    Thanks
    Nishaant

  • avatar

    Natalie

    Hi Chris, I’m getting /etc/resolv.conf: permission denied. May I know how to resolve this?

  • avatar

    Chris

    Hi, You should use root access, using the sudo command.

Leave a Reply