Let me start by saying that Linux, Open Source in general, JVM ecosystem and Hadoop rock!
Installing a new cluster using Cloudera 5.1.x
Note that IPs mentioned here are just some dummy sample IPs chosen for this post!
- Install CentOS 6.5 on all the nodes. We chose CentOS but you could chose any other Linux flavour as well — just make sure it is supported by Cloudera
- Loop in your IT/Networking team to give static IPs to the nodes. Note that the machine you intend to make NameNode should have the following entries in its /etc/hosts file. This file is used by Cloudera and other nodes to resolve IP addresses and host names. In our case we kept 172.16.2.37 as the NameNode —
172.16.2.37 n37.xyz.com n37127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.16.2.33 n33.xyz.com n33
172.16.2.36 n36.xyz.com n36
172.16.2.38 n38.xyz.com n38
172.16.2.39 n39.xyz.com n39Note that for static IP setup you need to have at least the following entries in your /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=static
IPADDR=172.16.2.33
PREFIX=25
GATEWAY=172.16.2.1
DEFROUTE=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
NAME=“System eth0”
NETMASK=255.255.255.128
DNS1=172.16.10.10
DNS2=172.16.10.11Once such entries have been made you will need to restart the network using the command service network restart
- The other 4 entries towards the end are for other data nodes (or slave machines). The data nodes should have the following entries in their /etc/hosts file. For example, our node 172.16.2.33 had the following -
172.16.2.37 n37.xyz.com n37
172.16.2.33 n33.xyz.com n33
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6Also, make sure that the entries in /etc/resolv.conf on all nodes (including namenode) should be -
search xyz.com
nameserver 172.16.10.10
nameserver 172.16.10.11 - Setting up password-less logins — Next we set up ssh between the machines. Note that in our case we did everything by logging in as ‘root’. Other than that, our user was ‘cloudera’ and in our case, even the root password is ‘cloudera’. One thing I want point upfront is that you should keep the same passwords for all machines because Cloudera setup might want password to go into different machines.log into your name node as ‘root’.
sudo yum install openssh-client (in Ubuntu’s case it will be ‘apt-get’ instead of ‘yum’)
sudo yum install openssh-serverssh-keygen ‑t rsa ‑P “” ‑f ~/.ssh/id_dsa (generating ssh-key) – note that the passphrase is empty string
ssh-copy-id ‑i $HOME/.ssh/id_dsa.pub username@slave-hostname (copy public key over to the node/slave machines. So, in our case, one example would be root@172.16.2.33)
cat $HOME/.ssh/id_dsa.pub » $HOME/.ssh/authorized_keys (you need to do this if the same machine would need to ssh itself. We did this too.). Note that the work is not done yet. We need to setup password less login from data nodes to name node also. At this point you will be able to log into data nodes from namenode with ssh root@datanodeName/IP, like root@n36.xyz.com. So, log into data nodes one by one and follow the procedure above to set password less ssh login from each node to name node. Once this is done, restart all machine using the command init 6. It’s awesome when you control so many machines from one! - Other configurations -
Disable selinux, vi /etc/selinux/config selinux=disabled
chkconfig iptables off in /etc –> restart of the nodes will be needed after thisWe have the following machines — 172.16.2.37 as namenode, datanodes — 172.16.2.33, 172.16.2.36, 172.16.2.38, 172.16.2.39
- Java installation. Check what Java versions are supported by Cloudera. In our case, we chose to install the latest version of Java 1.7. It was 1.7.0_71
Do the following on all nodes -
Download the java rpm for 64 bit java 1.7 on any of machines (172.16.2.37). Use scp command like this to transfer this rpm to all machines -
scp /root/Downloads/jdk_xyz_1.7.0_71.rpm root@172.16.2.33
On all nodes -
yum remove java (to remove earlier java versions)
rpm ‑ivh /root/Downloads/jdk_xyz_1.7.0_71.rpmRunning this rpm should install Java, now we need to set the paths and all right. Note that Cloudera requires java to install in the folder (/usr/java/jdk_xyz_1.7.0_nn where nn is the version number — 71 in our case) -
Now, no point in setting the environment variable like export $JAVA_HOME=whatever. Why? You will see that this variable is reset in each session of the bash. So, do like this ‑make a java.sh file like this -vi /etc/profile.d/java.sh
type the following in -
#!/bin/bash
JAVA_HOME=/usr/java/jdk1.7.0_71
PATH=$JAVA_HOME/bin:$PATH
export PATH JAVA_HOME
export CLASSPATH=.
save the filemake the file an executable by -
chmod +x /etc/profile.d/java.shrun the file by
source /etc/profile.d/java.shNow, check the java ‑version, which java. Everything should be good
Refer to — http://www.unixmen.com/install-oracle-java-jdk‑1–8‑centos‑6–5/ and http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Installation-Guide/cdh5ig_oracle_jdk_installation.html#topic_29_1
- Now we are ready to begin the next battle — actual Cloudera Hadoop installation!
If you use Linux on a regular basis then reading documentation should already be your second nature. In case it’s not, make it your second nature
No matter how complete any article is, you should always refer to the actual documentation. In this case, we are trying to setup CDH 5.1.x (Cloudera Distribution for Hadoop version 5.1.x) and the documentation is at — http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Installation-Guide/cdh5ig_cdh5_install.html . If it’s not here, you would still be able to find it with a simple google search!We would be going for automated installation of CDH (which is what is also recommended by Cloudera if you read their documentation. This is documented as ‘Path A’ installation!) — http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_install_path_A.html?scroll=cmig_topic_6_5 - You should now have python 2.6 or 2.7 installed on your machine. Check if it’s already there by typing in which python and python ‑version. If it’s not there, thensu ‑c ‘rpm ‑Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release‑5–4.noarch.rpm’
yum install python26
Download cloudera-manager-installer.bin from the cloudera site.
chmod u+x cloudera-manager-installer.bin (give permissions to the executable)
sudo ./cloudera-manager-installer.bin – run the executableWhen the installation completes, the complete URL provided for the Cloudera Manager Admin Console, including the port number, which is 7180 by default. Press Return or Enter to choose OK to continue.
http://localhost:7180
Username: admin Password: admin - From this point onwards, you are on your own to setup the cluster. I will point a few things here that I remember from our experience of cluster setup —
the entry in vi /etc/cloudera-agent/config.ini, for server_host should be the IP of your namenode and on namenode, it should be ‘localhost’. In most cases, we found that the problem was either with our /etc/hosts file (which is why I mentioned the exact entries we had for our namenodes and datanodes) or with JAVA_HOME.
We had used to option of ‘password’ when Cloudera asked us how it will login into the nodes (this is why we mention that make all root passwords same — ‘cloudera’ in our case). We also used the option of ’embedded Dbs’, it means that we did not install any PostGres or MySql databases on our own but we let Cloudera do that for us and we simply noted down the username, password, port numbers, etc. These are needed to serve as meta data holding databases for things like Impala, Hive, etc. Other than this we chose to install ‘Custom Services’ because we faced some problem installing Spark and we learned later that it was due to some known issue in which we had to transfer a jar file from spark folder into HDFS. So, chose not to install Spark right away. We can anyways do it later. So, we went for the installation of HDFS, Impala, Sqoop2, Hive, YARN, Zookeeper, OOzie, HBase and Solr.
One problem we faced was that even though we had told Cloudera that we have our own Java installed still it went ahead and installed it’s own java on our machine and used it’s path instead of our. For this, go to Cloudera Manager home page, click on the ‘Hosts’ tab, click on ‘Configuration’, go to ‘Advanced’ — there you will find an entry by the name ‘Java Home Directory’ and change the path to point to the path of your installation. In our case, we pointed it to ‘/usr/java/jdk1.7.0_71’.
Another thing I’d like to mention is that the CM web UI has some bugs. For example, when trying to re-configure the services we wanted to install, even though we had deselected a few services, it would still show them for installation. For this we had to simply delete the cluster and restart the whole process. In fact, we deleted and made the cluster many times during our cluster setup! So, don’t be afraid of this — go ahead and delete the whole cluster and re-do if you’re facing issues. Also, please restart the machines and the cluster many times (or when in doubt), you never know when something is not reflecting the latest changes.
Our namenode directory points to /dfs/nn2 in the hdfs rather than the default /dfs/nn, because for some reason, our /dfs/nn was corrupted when we had tried to move a Spark assembly file to hdfs and it prevented the namenode to get formatted (even though we tried to delete this file).
I’d also like to mention that we also changed the node wise configuration suggested by Cloudera. Basically, it was suggesting that we make n33.biginsights.com as our namenode (even though there was no reason for this). Now Cloudera had installed on the machine from where we were running the cloudera manager. So, to avoid any issues, we just made sure that n37.biginsights.com has all the services that Cloudera was suggesting for 33. Basically, we reversed the roles for 33 and 37!
As if now, our cluster is up and running. There might still be some glitches because we haven’t verified the cluster by running anything. I’ll post the changes we make as we make them.