Making a Hadoop Cluster using Cloudera CDH 5.1.x | Experience

Let me start by say­ing that Linux, Open Source in gen­eral, JVM ecosys­tem and Hadoop rock!ele1

 

Installing a new clus­ter using Cloud­era 5.1.x

Note that IPs men­tioned here are just some dummy sam­ple IPs cho­sen for this post!

  1. Install Cen­tOS 6.5 on all the nodes. We chose Cen­tOS but you could chose any other Linux flavour as well — just make sure it is sup­ported by Cloudera
  2. Loop in your IT/Networking team to give sta­tic IPs to the nodes. Note that the machine you intend to make NameN­ode should have the fol­low­ing entries in its /etc/hosts file. This file is used by Cloud­era and other nodes to resolve IP addresses and host names. In our case we kept 172.16.2.37 as the NameNode –

    172.16.2.37 n37.xyz.com n37127.0.0.1 local­host localhost.localdomain localhost4 localhost4.localdomain4
    ::1 local­host localhost.localdomain localhost6 localhost6.localdomain6
    172.16.2.33 n33.xyz.com n33
    172.16.2.36 n36.xyz.com n36
    172.16.2.38 n38.xyz.com n38
    172.16.2.39 n39.xyz.com n39

    Note that for sta­tic IP setup you need to have at least the fol­low­ing entries in your /etc/sysconfig/network-scripts/ifcfg-eth0

    DEVICE=eth0
    TYPE=Ethernet
    ONBOOT=yes
    NM_CONTROLLED=yes
    BOOTPROTO=static
    IPADDR=172.16.2.33
    PREFIX=25
    GATEWAY=172.16.2.1
    DEFROUTE=yes
    IPV4_FAILURE_FATAL=yes
    IPV6INIT=no
    NAME=“System eth0“
    NETMASK=255.255.255.128
    DNS1=172.16.10.10
    DNS2=172.16.10.11

    Once such entries have been made you will need to restart the net­work using the com­mand ser­vice net­work restart

  3. The other 4  entries towards the end are for other data nodes (or slave machines). The data nodes should have the fol­low­ing entries in their /etc/hosts file. For exam­ple, our node 172.16.2.33 had the fol­low­ing -
    172.16.2.37 n37.xyz.com n37
    172.16.2.33 n33.xyz.com n33
    127.0.0.1 local­host localhost.localdomain localhost4 localhost4.localdomain4
    ::1 local­host localhost.localdomain localhost6 localhost6.localdomain6

    Also, make sure that the entries in /etc/resolv.conf on all nodes (includ­ing namen­ode) should be -
    search xyz.com
    name­server 172.16.10.10
    name­server 172.16.10.11

  4. Set­ting up password-less logins — Next we set up ssh between the machines. Note that in our case we did every­thing by log­ging in as ‘root’. Other than that, our user was ‘cloud­era’ and in our case, even the root pass­word is ‘cloud­era’. One thing I want point upfront is that you should keep the same pass­words for all machines because Cloud­era setup might want pass­word to go into dif­fer­ent machines.log into your name node as ‘root’.

    sudo yum install openssh-client (in Ubuntu’s case it will be ‘apt-get’ instead of ‘yum’)
    sudo yum install openssh-server

    ssh-keygen –t rsa –P “” –f ~/.ssh/id_dsa (gen­er­at­ing ssh-key)   – note that the passphrase is empty string
    ssh-copy-id –i $HOME/.ssh/id_dsa.pub username@slave-hostname (copy pub­lic key over to the node/slave machines. So, in our case, one exam­ple would be root@172.16.2.33)
    cat $HOME/.ssh/id_dsa.pub » $HOME/.ssh/authorized_keys (you need to do this if the same machine would need to ssh itself. We did this too.). Note that the work is not done yet. We need to setup pass­word less login from data nodes to name node also. At this point you will be able to log into data nodes from namen­ode with ssh root@datanodeName/IP, like root@n36.xyz.com. So, log into data nodes one by one and fol­low the pro­ce­dure above to set pass­word less ssh login from each node to name node. Once this is done, restart all machine using the com­mand init 6. It’s awe­some when you con­trol so many machines from one!

  5. Other con­fig­u­ra­tions -
    Dis­able selinux, vi /etc/selinux/config selinux=disabled
    chk­con­fig ipt­a­bles off in /etc  –> restart of the nodes will be needed after this

    We have the fol­low­ing machines — 172.16.2.37 as namen­ode, datan­odes — 172.16.2.33, 172.16.2.36, 172.16.2.38, 172.16.2.39

  6. Java instal­la­tion. Check what Java ver­sions are sup­ported by Cloud­era. In our case, we chose to install the lat­est ver­sion of Java 1.7. It was 1.7.0_71

    Do the fol­low­ing on all nodes -
    Down­load the java rpm for 64 bit java 1.7 on any of machines (172.16.2.37). Use scp com­mand like this to trans­fer this rpm to all machines -
    scp /root/Downloads/jdk_xyz_1.7.0_71.rpm root@172.16.2.33
    On all nodes -
    yum remove java (to remove ear­lier java ver­sions)
    rpm –ivh /root/Downloads/jdk_xyz_1.7.0_71.rpmRunning this rpm should install Java, now we need to set the paths and all right. Note that Cloud­era requires java to install in the folder (/usr/java/jdk_xyz_1.7.0_nn where nn is the ver­sion num­ber — 71 in our case) -
    Now, no point in set­ting the envi­ron­ment vari­able like export $JAVA_HOME=whatever. Why? You will see that this vari­able is reset in each ses­sion of the bash. So, do like this –make a java.sh file like this -

    vi /etc/profile.d/java.sh

    type the fol­low­ing in -

    #!/bin/bash
    JAVA_HOME=/usr/java/jdk1.7.0_71
    PATH=$JAVA_HOME/bin:$PATH
    export PATH JAVA_HOME
    export CLASSPATH=.
    save the file

    make the file an exe­cutable by -
    chmod +x /etc/profile.d/java.sh

    run the file by
    source /etc/profile.d/java.sh

    Now, check the java –ver­sion, which java. Every­thing should be good (smile)

    Refer to - http://www.unixmen.com/install-oracle-java-jdk-1–8-centos-6–5/  and http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Installation-Guide/cdh5ig_oracle_jdk_installation.html#topic_29_1

  7. Now we are ready to begin the next bat­tle — actual Cloud­era Hadoop instal­la­tion!
    If you use Linux on a reg­u­lar basis then read­ing doc­u­men­ta­tion should already be your sec­ond nature. In case it’s not, make it your sec­ond nature (smile)
    No mat­ter how com­plete any arti­cle is, you should always refer to the actual doc­u­men­ta­tion. In this case, we are try­ing to setup CDH 5.1.x (Cloud­era Dis­tri­b­u­tion for Hadoop ver­sion 5.1.x) and the doc­u­men­ta­tion is at - http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Installation-Guide/cdh5ig_cdh5_install.html . If it’s not here, you would still be able to find it with a sim­ple google search!We would be going for auto­mated instal­la­tion of CDH (which is what is also rec­om­mended by Cloud­era if you read their doc­u­men­ta­tion. This is doc­u­mented as ‘Path A’ instal­la­tion!) - http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-latest/Cloudera-Manager-Installation-Guide/cm5ig_install_path_A.html?scroll=cmig_topic_6_5

  8. You should now have python 2.6 or 2.7 installed on your machine. Check if it’s already there by typ­ing in which python and python –ver­sion. If it’s not there, thensu –c ‘rpm –Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5–4.noarch.rpm

    yum install python26
    Down­load cloudera-manager-installer.bin from the cloud­era site.
    chmod u+x cloudera-manager-installer.bin  (give per­mis­sions to the exe­cutable)
    sudo ./cloudera-manager-installer.bin – run the executable

    When the instal­la­tion com­pletes, the com­plete URL pro­vided for the Cloud­era Man­ager Admin Con­sole, includ­ing the port num­ber, which is 7180 by default. Press Return or Enter to choose OK to con­tinue.
    http://local­host:7180
    User­name: admin Pass­word: admin

  9. From this point onwards, you are on your own to setup the clus­ter. I will point a few things here that I remem­ber from our expe­ri­ence of clus­ter setup –

    the entry in vi /etc/cloudera-agent/config.ini, for server_host should be the IP of your namen­ode and on namen­ode, it should be ‘local­host’. In most cases, we found that the prob­lem was either with our /etc/hosts file (which is why I men­tioned the exact entries we had for our namen­odes and datan­odes) or with JAVA_HOME.

    We had used to option of ‘pass­word’ when Cloud­era asked us how it will login into the nodes (this is why we men­tion that make all root pass­words same — ‘cloud­era’ in our case). We also used the option of ‘embed­ded Dbs’, it means that we did not install any Post­Gres or MySql data­bases on our own but we let Cloud­era do that for us and we sim­ply noted down the user­name, pass­word, port num­bers, etc. These are needed to serve as meta data hold­ing data­bases for things like Impala, Hive, etc. Other than this we chose to install ‘Cus­tom Ser­vices’ because we faced some prob­lem installing Spark and we learned later that it was due to some known issue in which we had to trans­fer a jar file from spark folder into HDFS. So, chose not to install Spark right away. We can any­ways do it later.  So, we went for the instal­la­tion of HDFS, Impala, Sqoop2, Hive, YARN, Zookeeper, OOzie, HBase and Solr.

    One prob­lem we faced was that even though we had told Cloud­era that we have our own Java installed still it went ahead and installed it’s own java on our machine and used it’s path instead of our. For this, go to Cloud­era Man­ager home page, click on the ‘Hosts’ tab, click on ‘Con­fig­u­ra­tion’, go to ‘Advanced’ —  there you will find an entry by the name ‘Java Home Direc­tory’ and change the path to point to the path of your instal­la­tion. In our case, we pointed it to ‘/usr/java/jdk1.7.0_71’.

    Another thing I’d like to men­tion is that the CM web UI has some bugs. For exam­ple, when try­ing to re-configure the ser­vices we wanted to install, even though we had des­e­lected a few ser­vices, it would still show them for instal­la­tion. For this we had to sim­ply delete the clus­ter and restart the whole process. In fact, we deleted and made the clus­ter many times dur­ing our clus­ter setup! So, don’t be afraid of this — go ahead and delete the whole clus­ter and re-do if you’re fac­ing issues. Also, please restart the machines and the clus­ter many times (or when in doubt), you never know when some­thing is not reflect­ing the lat­est changes.

    Our namen­ode direc­tory points to /dfs/nn2 in the hdfs rather than the default /dfs/nn, because for some rea­son, our /dfs/nn was cor­rupted when we had tried to move a Spark assem­bly file to hdfs and it pre­vented the namen­ode to get for­mat­ted (even though we tried to delete this file).

    I’d also like to men­tion that we also changed the node wise con­fig­u­ra­tion sug­gested by Cloud­era. Basi­cally, it was sug­gest­ing that we make n33.biginsights.com as our namen­ode (even though there was no rea­son for this). Now Cloud­era had installed on the machine from where we were run­ning the cloud­era man­ager. So, to avoid any issues, we just made sure that n37.biginsights.com has all the ser­vices that Cloud­era was sug­gest­ing for 33. Basi­cally, we reversed the roles for 33 and 37!

    As if now, our clus­ter is up and run­ning. There might still be some glitches because we haven’t ver­i­fied the clus­ter by run­ning any­thing. I’ll post the changes we make as we make them.

  • Some pics –

    ele4 ele6ele4ele5ele6

Making your subdomain respect index.html, index.php, etc.

subdomains image

I am not sure about you but I faced an issue when I was try­ing to add my resume as a sep­a­rate sub­do­main on my domain. The prob­lem was that when I added the index.html page into the sub­do­main folder and tried to browse the site — it just didn’t work! Although the main site panghal.com also had an index.html and it worked per­fectly fine, the sub­do­main was not respect­ing the spe­cial posi­tion that’s to be enjoyed by index.html. After some search­ing on the web I found that I had to mod­ify the .htac­cess present in my root. I just added the fol­low­ing line -

Direc­to­ryIn­dex index.html index.php

and then it all worked just fine.

How to debug your tests without Re# in a jiffy

Resharper is a crutch

 

 

 

 

 

1. In Visual Stu­dio, right click on the NUnit Test project, select the Prop­er­ties option.
2. Go to the Debug tab, select the ‘Start Exter­nal Pro­gram’ radio but­ton. Browse to packages\NUnit.Runners.2.6.2\tools\nunit.exe
3. In the ‘Com­mand line argu­ments’ enter the test DLL (exam­ple — XYZ.ABC.Tests.dll). Don’t for­get the .dll
4. Make the test project as startup project by right-clicking on the project in the Solu­tion Explorer and select­ing Set as Startup Project.
Hit F5 to launch NUnit, select and run the appro­pri­ate test method to be debugged, and enjoy the debug­ging goodness.

Note — Due to NUnit ver­sions and .net ver­sions incom­pat­i­bil­i­ties, you may still not be able to debug. In that case, go to packages\NUnit.Runners.2.6.2\tools and find nunit.exe.config and edit the startup to be
<startup useLegacyV2RuntimeActivationPolicy=“true”>
<!– Com­ment out the next line to force use of .NET 4.0 –>
<require­dRun­time version=“4.0.30319″ />
</startup>
and
edit nunit-agent.exe.config’s <startup> to be
<startup useLegacyV2RuntimeActivationPolicy=“false”>
<sup­port­e­dRun­time version=“v4.0.30319″ />
</startup>

Home Page in Html5 and knowing a lot about your code style

www.panghal.comSo, by the time you read this the home­page would have been hosted. Right now it’s just four sta­tic web pages and all of it has been done in html5. This new ‘draft’ stan­dard has intro­duces a slew of new JavaScript APIs and it has become increas­ingly impor­tant to know and bet­ter under­stand effi­cient javascript cod­ing prac­tices.
Any­ways, the lit­tle project was started just because I wanted to dirty my hands with Html5 and it has been worth the lit­tle effort that I put in. I learned a lot.
If there are two ver­sions of the home page hosted then it means I have not been lazy in mak­ing use of the newer and more b’ful tem­plate. As you might have guessed, at the time of writ­ing this post I have not yet made use of the new tem­plate. Its just four html pages. How much effort could it be!
So, with­out any fur­ther delay. Here’s what I learned -

  • There was a lot of code stitch­ing from google and else­where. And it’s not so much of a bad thing if you learn from it — which I did!
  • Don’t look at my javascript. I mean I’ve coded so badly. I don’t even code so badly at my workplace…lol. I used a lot of global vari­ables even though I could have min­i­mized their usage a lot. There’s been rep­e­ti­tion of code espe­cially with state­ments like document.getElementById(). It’s sim­ply because I did not fol­low any design con­straints — which is a very bad thing, by the way. There were some more incon­sis­ten­cies in my code.
  • Although, I did not do so good I also did not do very bad. I used the new html5 markup, played around with video and audio ele­ments, made a White­board using can­vas and local­Stor­age, got to know a lot more browser spe­cific stuff, learned about audio video codecs, con­tainer for­mats, the impor­tance of MIME types, and lots more. I also got to know a lot about CSS and stuff.
  • I have set a solid base for a good javascript know how. I already know what all I can do in my code if I set out to imple­ment stuff like inner functions(scope chain­ing and stuff) or take advan­tage of prototypes.
  • It’s extremely impor­tant to code in your free time and it’s also impor­tant to look at the code of other pro­gram­mers. This is how you can know where you actu­ally stand.
  • I learned that it becomes a blot on your char­ac­ter if you con­tinue to repeat the pat­tern of start­ing projects and not fin­ish­ing them. I had been fol­low­ing this pat­tern for the past few years. No more. All that rage is com­ing out in code
  • Music is extremely impor­tant to code. Food is optional.
  • The more you code, the more you real­ize about your insignif­i­cance. It keeps you way too hum­ble and down to earth.
  • With the intro­duc­tion of html5 and it’s ram­pant usage these days, javascript has become the ubiq­ui­tous lan­guage of the client side and with its intro­duc­tion to server side pro­gram­ming (node.js), I think it’s going to be the lan­guage of the year.
  • It’s not so easy to write catchy eng­lish at 3 am.
  • There are a lot of small js apps I want to build and inte­grate them with the home­page but it’s impor­tant to ship code min­i­mal code first!

I also wanted to use Git and a lot more had to be done but I have to wrap up this month with a few more things I have in mind.

So how did you like the home­page? It’s not much. It didn’t had to be. It was just a sim­ple learn­ing exer­cise for html5 but when there were four pages to show I sim­ply hosted them. And in the process I learned a lot about myself, my cod­ing styles, life in general..lol

Com­ments?