Category Archives: Hadoop

Making a Hadoop Cluster using Cloudera CDH 5.1.x | Experience

Let me start by say­ing that Lin­ux, Open Source in gen­er­al, JVM ecosys­tem and Hadoop rock!ele1

 

Installing a new clus­ter using Cloud­era 5.1.x

Note that IPs men­tioned here are just some dum­my sam­ple IPs cho­sen for this post!

  1. Install Cen­tOS 6.5 on all the nodes. We chose Cen­tOS but you could chose any oth­er Lin­ux flavour as well — just make sure it is sup­port­ed by Cloud­era
  2. Loop in your IT/Networking team to give sta­t­ic IPs to the nodes. Note that the machine you intend to make NameN­ode should have the fol­low­ing entries in its /etc/hosts file. This file is used by Cloud­era and oth­er nodes to resolve IP address­es and host names. In our case we kept 172.16.2.37 as the NameN­ode —

    172.16.2.37 n37.xyz.com n37127.0.0.1 local­host localhost.localdomain localhost4 localhost4.localdomain4
    ::1 local­host localhost.localdomain localhost6 localhost6.localdomain6
    172.16.2.33 n33.xyz.com n33
    172.16.2.36 n36.xyz.com n36
    172.16.2.38 n38.xyz.com n38
    172.16.2.39 n39.xyz.com n39

    Note that for sta­t­ic IP set­up you need to have at least the fol­low­ing entries in your /etc/syscon­fig/net­work-script­s/ifcfg-eth0

    DEVICE=eth0
    TYPE=Ethernet
    ONBOOT=yes
    NM_CONTROLLED=yes
    BOOTPROTO=static
    IPADDR=172.16.2.33
    PREFIX=25
    GATEWAY=172.16.2.1
    DEFROUTE=yes
    IPV4_FAILURE_FATAL=yes
    IPV6INIT=no
    NAME=“System eth0”
    NETMASK=255.255.255.128
    DNS1=172.16.10.10
    DNS2=172.16.10.11

    Once such entries have been made you will need to restart the net­work using the com­mand ser­vice net­work restart

  3. The oth­er 4  entries towards the end are for oth­er data nodes (or slave machines). The data nodes should have the fol­low­ing entries in their /etc/hosts file. For exam­ple, our node 172.16.2.33 had the fol­low­ing -
    172.16.2.37 n37.xyz.com n37
    172.16.2.33 n33.xyz.com n33
    127.0.0.1 local­host localhost.localdomain localhost4 localhost4.localdomain4
    ::1 local­host localhost.localdomain localhost6 localhost6.localdomain6

    Also, make sure that the entries in /etc/resolv.conf on all nodes (includ­ing namen­ode) should be -
    search xyz.com
    name­serv­er 172.16.10.10
    name­serv­er 172.16.10.11

  4. Set­ting up pass­word-less logins — Next we set up ssh between the machines. Note that in our case we did every­thing by log­ging in as ‘root’. Oth­er than that, our user was ‘cloud­era’ and in our case, even the root pass­word is ‘cloud­era’. One thing I want point upfront is that you should keep the same pass­words for all machines because Cloud­era set­up might want pass­word to go into dif­fer­ent machines.log into your name node as ‘root’.

    sudo yum install openssh-client (in Ubun­tu’s case it will be ‘apt-get’ instead of ‘yum’)
    sudo yum install openssh-serv­er

    ssh-key­gen ‑t rsa ‑P “” ‑f ~/.ssh/id_dsa (gen­er­at­ing ssh-key)   – note that the passphrase is emp­ty string
    ssh-copy-id ‑i $HOME/.ssh/id_dsa.pub username@slave-hostname (copy pub­lic key over to the node/slave machines. So, in our case, one exam­ple would be root@172.16.2.33)
    cat $HOME/.ssh/id_dsa.pub » $HOME/.ssh/authorized_keys (you need to do this if the same machine would need to ssh itself. We did this too.). Note that the work is not done yet. We need to set­up pass­word less login from data nodes to name node also. At this point you will be able to log into data nodes from namen­ode with ssh root@datanodeName/IP, like root@n36.xyz.com. So, log into data nodes one by one and fol­low the pro­ce­dure above to set pass­word less ssh login from each node to name node. Once this is done, restart all machine using the com­mand init 6. It’s awe­some when you con­trol so many machines from one!

  5. Oth­er con­fig­u­ra­tions -
    Dis­able selin­ux, vi /etc/selinux/config selinux=disabled
    chk­con­fig ipt­a­bles off in /etc  –> restart of the nodes will be need­ed after this

    We have the fol­low­ing machines — 172.16.2.37 as namen­ode, datan­odes — 172.16.2.33, 172.16.2.36, 172.16.2.38, 172.16.2.39

  6. Java instal­la­tion. Check what Java ver­sions are sup­port­ed by Cloud­era. In our case, we chose to install the lat­est ver­sion of Java 1.7. It was 1.7.0_71

    Do the fol­low­ing on all nodes -
    Down­load the java rpm for 64 bit java 1.7 on any of machines (172.16.2.37). Use scp com­mand like this to trans­fer this rpm to all machines -
    scp /root/Downloads/jdk_xyz_1.7.0_71.rpm root@172.16.2.33
    On all nodes -
    yum remove java (to remove ear­li­er java ver­sions)
    rpm ‑ivh /root/Downloads/jdk_xyz_1.7.0_71.rpmRunning this rpm should install Java, now we need to set the paths and all right. Note that Cloud­era requires java to install in the fold­er (/usr/java/jdk_xyz_1.7.0_nn where nn is the ver­sion num­ber — 71 in our case) -
    Now, no point in set­ting the envi­ron­ment vari­able like export $JAVA_HOME=whatever. Why? You will see that this vari­able is reset in each ses­sion of the bash. So, do like this ‑make a java.sh file like this -

    vi /etc/profile.d/java.sh

    type the fol­low­ing in -

    #!/bin/bash
    JAVA_HOME=/usr/java/jdk1.7.0_71
    PATH=$JAVA_HOME/bin:$PATH
    export PATH JAVA_HOME
    export CLASSPATH=.
    save the file

    make the file an exe­cutable by -
    chmod +x /etc/profile.d/java.sh

    run the file by
    source /etc/profile.d/java.sh

    Now, check the java ‑ver­sion, which java. Every­thing should be good (smile)

    Refer to — http://www.unixmen.com/install-oracle-java-jdk‑1–8‑centos‑6–5/  and http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Installation-Guide/cdh5ig_oracle_jdk_installation.html#topic_29_1

  7. Now we are ready to begin the next bat­tle — actu­al Cloud­era Hadoop instal­la­tion!
    If you use Lin­ux on a reg­u­lar basis then read­ing doc­u­men­ta­tion should already be your sec­ond nature. In case it’s not, make it your sec­ond nature (smile)
    No mat­ter how com­plete any arti­cle is, you should always refer to the actu­al doc­u­men­ta­tion. In this case, we are try­ing to set­up CDH 5.1.x (Cloud­era Dis­tri­b­u­tion for Hadoop ver­sion 5.1.x) and the doc­u­men­ta­tion is at — http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Installation-Guide/cdh5ig_cdh5_install.html . If it’s not here, you would still be able to find it with a sim­ple google search!We would be going for auto­mat­ed instal­la­tion of CDH (which is what is also rec­om­mend­ed by Cloud­era if you read their doc­u­men­ta­tion. This is doc­u­ment­ed as ‘Path A’ instal­la­tion!) — http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_install_path_A.html?scroll=cmig_topic_6_5

  8. You should now have python 2.6 or 2.7 installed on your machine. Check if it’s already there by typ­ing in which python and python ‑ver­sion. If it’s not there, then­su ‑c ‘rpm ‑Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release‑5–4.noarch.rpm

    yum install python26
    Down­load cloudera-manager-installer.bin from the cloud­era site.
    chmod u+x cloudera-manager-installer.bin  (give per­mis­sions to the exe­cutable)
    sudo ./cloudera-manager-installer.bin – run the exe­cutable

    When the instal­la­tion com­pletes, the com­plete URL pro­vid­ed for the Cloud­era Man­ag­er Admin Con­sole, includ­ing the port num­ber, which is 7180 by default. Press Return or Enter to choose OK to con­tin­ue.
    http://local­host:7180
    User­name: admin Pass­word: admin

  9. From this point onwards, you are on your own to set­up the clus­ter. I will point a few things here that I remem­ber from our expe­ri­ence of clus­ter set­up —

    the entry in vi /etc/cloudera-agent/config.ini, for server_host should be the IP of your namen­ode and on namen­ode, it should be ‘local­host’. In most cas­es, we found that the prob­lem was either with our /etc/hosts file (which is why I men­tioned the exact entries we had for our namen­odes and datan­odes) or with JAVA_HOME.

    We had used to option of ‘pass­word’ when Cloud­era asked us how it will login into the nodes (this is why we men­tion that make all root pass­words same — ‘cloud­era’ in our case). We also used the option of ’embed­ded Dbs’, it means that we did not install any Post­Gres or MySql data­bas­es on our own but we let Cloud­era do that for us and we sim­ply not­ed down the user­name, pass­word, port num­bers, etc. These are need­ed to serve as meta data hold­ing data­bas­es for things like Impala, Hive, etc. Oth­er than this we chose to install ‘Cus­tom Ser­vices’ because we faced some prob­lem installing Spark and we learned lat­er that it was due to some known issue in which we had to trans­fer a jar file from spark fold­er into HDFS. So, chose not to install Spark right away. We can any­ways do it lat­er.  So, we went for the instal­la­tion of HDFS, Impala, Sqoop2, Hive, YARN, Zookeep­er, OOzie, HBase and Solr.

    One prob­lem we faced was that even though we had told Cloud­era that we have our own Java installed still it went ahead and installed it’s own java on our machine and used it’s path instead of our. For this, go to Cloud­era Man­ag­er home page, click on the ‘Hosts’ tab, click on ‘Con­fig­u­ra­tion’, go to ‘Advanced’ —  there you will find an entry by the name ‘Java Home Direc­to­ry’ and change the path to point to the path of your instal­la­tion. In our case, we point­ed it to ‘/usr/java/jdk1.7.0_71’.

    Anoth­er thing I’d like to men­tion is that the CM web UI has some bugs. For exam­ple, when try­ing to re-con­fig­ure the ser­vices we want­ed to install, even though we had des­e­lect­ed a few ser­vices, it would still show them for instal­la­tion. For this we had to sim­ply delete the clus­ter and restart the whole process. In fact, we delet­ed and made the clus­ter many times dur­ing our clus­ter set­up! So, don’t be afraid of this — go ahead and delete the whole clus­ter and re-do if you’re fac­ing issues. Also, please restart the machines and the clus­ter many times (or when in doubt), you nev­er know when some­thing is not reflect­ing the lat­est changes.

    Our namen­ode direc­to­ry points to /dfs/nn2 in the hdfs rather than the default /dfs/nn, because for some rea­son, our /dfs/nn was cor­rupt­ed when we had tried to move a Spark assem­bly file to hdfs and it pre­vent­ed the namen­ode to get for­mat­ted (even though we tried to delete this file).

    I’d also like to men­tion that we also changed the node wise con­fig­u­ra­tion sug­gest­ed by Cloud­era. Basi­cal­ly, it was sug­gest­ing that we make n33.biginsights.com as our namen­ode (even though there was no rea­son for this). Now Cloud­era had installed on the machine from where we were run­ning the cloud­era man­ag­er. So, to avoid any issues, we just made sure that n37.biginsights.com has all the ser­vices that Cloud­era was sug­gest­ing for 33. Basi­cal­ly, we reversed the roles for 33 and 37!

    As if now, our clus­ter is up and run­ning. There might still be some glitch­es because we haven’t ver­i­fied the clus­ter by run­ning any­thing. I’ll post the changes we make as we make them.

  • Some pics —

    ele4 ele6ele4ele5ele6