Minimal Gradle Post | What you need to know

I’ve been using Gradle for sometime at the workplace and I find it to be a good build tool. In the past I’ve SBT, Maven and Nant as well as, well, MSBuild. It was only MSBuild that I grew much familiar with back in my .net days.

This is a minimal Gradle post for whoever wants to get started with it with little/no fuss. Let’s get right into it.

What’s Gradle used for – It’s a build management tool. What this means is that you can use Gradle to manage building your projects, manage dependencies within projects and external to them, manage how tests are run, write pre/post build tasks/targets, and so on. You get the idea.

Two main ideas in GradleProjects and tasks

A gradle build can consist of one or more projects and each project can have one more more tasks. Tasks are nothing but units of work need to be done. A gradle project needs to have at least one build.gradle file. If there is only one such file it needs to be present in the root directory of the project. But you can additionally have build.gradle for each project in your project. Gradle build files contain DSL code and can also contain Groovy or Kotlin code. The DSL is basically a declarative kind of language.

At this point, let’s dive into code and see a few things first hand. I will be using IntelliJ IDEA to set up a Gradle project. You are free to use any editor of your choice as long as it has support for Gradle.

Download IntelliJ, install Gradle plugin (it would be checked by default), Setup a new ‘Gradle’ project. You will see a build.gradle file setup for you as well as a settings.gradle file.

Note that you will also see a few more things – gradlewrapper, and gradlew.bat files

So, what’s gradlewrapper?
Say you run multiple projects on your laptop. Some might be using SBT, some Maven, some Gradle, etc. Gradlewrapper let’s you use Gradle local to your project by providing Gradle jars that don’t need to be available system wide. Meaning that you won’t need to install Gradle on your system. You can use the Gradle jars and scripts that come with Gradlewrapper and that’s the end of story. To use Gradle through Gradlewrapper you will need to submit your Gradle commands to the scripts that ship with Gradlewrapper – gradlew.bat for windows and for *nix, much the same way as you would use the ‘gradle’ command from command line.

so, for example a normal gradle command could look something like


but if you don’t want to install Gradle on your system and rather just use Gradlewrapper, you will write something like


Project listing goes in settings.gradle
settings.gradle will contain at least one project name – the root project, by default it’s the name of the directory containing the main/root project of your setup. You can of course, change this name to anything of your liking. Over time, if you have more projects, their names also go in settings.gradle file.

The Gradle system is rich with plugins. All plugins available can be seen at this url. Plugins provide additional functionality to your Gradle setup. Plugins might have pre-written tasks that you might not want/have to write yourself, apart from lot more functionalities.
For example, if you want Scala to be the language for your project and you’d like Gradle to make src/test directories for the same, you can use the ‘scala’ plugin and ‘refresh’ your build.gradle and you will see the scala related directories in your project structure.
(By the way, you can also do a ‘refresh’ by executing gradle –refresh-dependencies task from command line)

For instance, my build.gradle file for this exercise looks like this –

group 'org.practice'
version '1.0-SNAPSHOT' //when version is set you will have a jar file with 
//name like mylibName-1.0-SNAPSHOT.jar, in this case

apply plugin: 'scala'

sourceCompatibility = 1.8

repositories {

dependencies {
    //example on how to target jars in a specific local directory
    //compile files('/home/vaibhav/spark/spark-2.1.0/core/target/spark-core_2.11-2.1.0.jar')
    compile group: 'org.apache.spark',name: 'spark-core_2.11', version: '2.1.0'
    testCompile group: 'junit', name: 'junit', version: '4.11'

Couple of things to note above –
1. You can see how to apply a plugin.
2. When you might have setup your sample project(as mentioned above) as a gradle project, you would have been asked to supply groupId and artifactId. GroupId, ArtifactId and Version, also called GAV, is a pretty standard way in the mvn world to identify a library. Gradle follows the same mvn convention and in fact, the gradle project structure by convention, you will find, is much like mvn project structure.


3. There would be a sourceCompatibility option meaning that you compile your code to be compatible with a certain version of java run time.
4. Under ‘repositories’, you can list the repos Gradle will hit to search for the dependencies mentioned under ‘dependencies’. You can mention any number of repos by providing a url. MavenCentral is available by default.
You can also specify local file system path to load dependencies from, something like

runtime files(‘lib/project1.jar’,’lib2/project2.jar’)
There are many ways to specify how to pick, from where to pick, and what to exclude. I am sure you can find the right syntax when you need it.

5. You can see that // works for comments, as well as /* …. */
6. When mentioning what dependencies to download, you provide their GAV. By the way, if you don’t tell gradle where to download your dependencies, it will do so by default at ~/.gradle/caches/
7. You can group your dependencies – above you can see that I don’t need junit in my main project but I need it in my test project. So I use a pre-built group ‘testCompile’. Similarly, for the ‘compile’ group.

If you’d rather not use gradlewrapper, you can install gradle using brew in Mac or apt-get install on Ubuntu, etc. You might need to set GRADLE_HOME if the installation does not already does that for you.
You can specify your gradle specific settings like what jvm to use, whether to keep gradle daemon alive(recommended setting), etc. in a file. If you are using gradlewrapper, you should already have file for you.

Gradle Tasks
So far, we’ve talked about setting up gradle and about projects. Let’s take a look at tasks now.
A build.gradle file can have many tasks, these tasks can be grouped or not grouped. When not grouped, a task is considered a ‘private’ task and will show up under ‘Other tasks’ when seeing the output of command ‘gradle tasks’, which lists all tasks. Here’s an example of a grouped task and an ungrouped task-

task calledOnce {
    println("I will get printed first")

task helloworld {
    group 'some group name'
    description 'this is desc for task helloworld'

    doFirst {
        println("hello world task begins")

    doLast {
        println("hello world task ends")

> gradle helloworld

> I will get printed first
> hello world task begins
> hello world task ends

doFirst, doLast are phases within a task. Statements in task closure not under any phase by default go into configuration phase and are executed once before any other phases, no matter whether you call that particular task or not.

Tasks can depend on each other
This is a normal procedure for most build scripts. Execution order of tasks depends on which tasks depend on which ones.


defaultTasks 'compile' //since a default task has been specified, on the command line it will suffice to just type 'gradle'

apply plugin: 'scala'

task compile(dependsOn: 'compileScala') {
   doLast { 
       println("compile task")

//shorter syntax for defining doLast, doFirst
compileScala.doLast(dependsOn: 'clean') {
    println("compileScala task given by scala plugin")

task clean {
    doLast { 
       println("clean task")

Note that you can also create your own tasks by extending the DefaultTask class, but this is not something we are going to try in this post.

Multi-project build structure
Usually you would have a few sub-projects or modules in your project and they also need to be built, tests executed for them and there might be dependencies among these sub-projects. Of course, gradle gives you option here. Let’s say you have project structure like this –

– Web
– Core
– Util

In this case, your settings.gradle would contain ‘include’ something like this –

include 'Core', 'Web', 'Util' //telling gradle we have sub projects

Now, the build.gradle file needs to cater for these sub-projects as well. There will be some tasks that are specific to sub-projects, some tasks common for all, etc. Take a look –

//will be called once for each project, including root
allprojects {
    //groupId: ....
    //version: ....

//does not apply to root
subprojects {
    //apply specific plugins, etc.

//root_project specific dependencies
project(':org.practice.root_project').dependencies {
    compile project(':org.practice.Core'), project(':Util') 
    compile 'some lib'


allprojects {

subprojects {

//Core specific stuff
project(':Core') {
    dependencies {

//Util specific stuff
project(':Util') {

//Web specific stuff
project(':Web') {

Note that sub-project specific stuff could also go under their own separate build.gradle files.

Publishing artifacts
Published files are called artifacts. Publishing is usually done with the help of some plugin, like maven.

apply plugin: 'maven'

uploadArchives { //maybe some nexus repo or local
    repositories {
        mavenDeployer {
            repository(url:"some url")

Hope the above gives you necessary information to get going with gradle! Thanks for reading!

All you need to know about Unit Testing | Minimal Post

Unit/Integration testing is something most developers do on a daily basis. I had always wanted to write a blog on this subject – something minimal but that covers the essentials and gives one the necessary terminology and tools to get going in the beginning. This post is targeted towards junior developers. So, here we go –

Recommended Book – The Art of Unit Testing, 2nd Edition

What is Integration Testing and what is Unit Testing?

Well, software testing comes in two flavors, if you will – Unit testing and Integration testing. What’s the basic difference, you ask? Let’s say you want to test some component X of the software. Mostly, this X is some method/function. X might depend on some db calls, file system, etc. If you use real external resources like a real db call, etc. to test X then you are doing Integration testing but instead, if you are faking or stubbing these external resources then you are doing Unit testing – basically testing just that unit X and faking most things it depends on. And it makes sense, you want to test just one single functionality and not depend on anything for that.

Using stubs to break dependencies

Your code X ———> depends on some external dependency (any object like file system, threads, db calls, etc.)

But when you want to Unit test X, it would be something like this –

Your code X ———> stub in place of external dependency

You might have heard about Mocks, stubs, etc. Don’t get bogged down by terminology – stubs and mocks are both fakes. And, from The Art of Unit Testing – Mocks are like stubs but you assert against a mock but you don’t assert against a stub. We’ll see all this soon.


public bool IsValidFileName(String fn)
    //read some conf file from file system
    //to see if extension is supported by your company or not.

Whatever code is there in place of comments above would qualify as external dependency. So, how do we test the above?
One way would be to actually read a conf file and use it for testing and then destroy it. Of course, this would be time consuming and would be an integration test and not a Unit test.

So, how to test without resorting to integration test – The code at hand is directly tied to external dependency. It is calling the fs directly, we need to first decouple this. Something like this,


If we do this, our code would not directly depend on fs but on something else that is directly dependent on fs. This something else, the fileSystemManager in our code is something we that can use for a real fs in case of actual code and a fake fs in case of testing.

Another terminology – Seams

Opportunities in code where a different functionality can be plugged in. This can be done through interfaces (fileSystemManager above) so that it can be overridden. Seams are implemented through what’s called an Open-Closed principle – class should be open for extension but closed for modification.

So, for example, from the above we can have –

    public class FileSystemManager: IFileSystemManager
        public bool IsValid(String fileName)
         ..some production code interacting with real fs 

So, now your code looks something like this - 

    public bool IsValidFileName(String fileName)
        IFileSystemManager mgr = new FileSystemManager();
        return mrg.IsValid(fileName);

So, that is your code but what to use in tests? – a stub!

    public class FakeFileSystemManager: IFileSystemManager
        public bool IsValid(String fileName)
            return true; //no need to interact with real fs

So, ok, good, we have created a way to bifurcate between actual production code and tests code. But wait! Remember that we needed to test IsValidFileName method. The issue is the following line of code –

IFileSystemManager mgr = new FileSystemManager();
We need to remove this instantiating of a concrete class that’s using ‘new’ right now. Because if we don’t, no matter where we call IsValidFileName from(tests or actual code), it will always try to new up FileSystemManager. This is where DI containers come in but we won’t be going into details of DI here, so let’s take a look at something called ‘constructor level injection’.

    public classUnderTest
        private IFileSystemManager _mgr;
        public classUnderTest(IFileSystemManager mgr)
            _mgr = mgr 

        public IsValidFileName(String fileName)
            return _mgr.IsValid(fileName); //so just like that we got rid of new! and this
            //code is ready for testing - we will send in fake fileSystemManager in case of tests and
            //'real' fileSystem in case of production code

So, the actual test now would look something like –

    public class ClassUnderTest_Tests

     public void
         IFileSystemManager myFakeManager =
                 new FakeFileSystemManager();
         myFakeManager.WillBeValid = true;
         ClassUnderTest obj = new ClassUnderTest (myFakeManager);
         bool result = obj.IsValidFileName("short.ext");
    internal class FakeFileSystemManager : IFileSystemManager
        public bool WillBeValid = false;
        public bool IsValid(string fileName)
            return WillBeValid;

Once you’ve understood the above basic concept, other stuff will come easily. So for instance, here’s one way to make your fakes return an exception –

    class FakeFileSystemManager: IFileSystemManager
        public bool WillBeValid = false;
        public Exception WillThrow = null;

        public bool IsValid(String fileName)
            if(WillThrow != null) //where WillThrow can be configured from the calling code.
                throw WillThrow;
            return WillBeValid;

Till now we have seen how to write our own fakes but let’s see how to avoid handwritten fakes.

But before that, I would like to broadly say that testing can be of three types (from Art of Unit Testing) –
1. Value based – When you want to check whether the value returned is good or not
2. State based – When you want to check whether the code being tested has expected state or not (certain variables have been set correctly or not)
3. Interaction based – When you need to know if code in one object called another object’s method correctly or not. You use it when calling another object is the end result of your unit being tested. This is something that you will encounter a lot in real life coding.

Let’s make formal distinction between mocks, fakes and stubs now. I will be quoting from The Art of Unit Testing.
“mock object is a fake object in the system that decides whether the unit test has passed or failed. It does so by verifying whether the object under test called the fake object as expected”

“A fake is a generic term that can be used to describe either a stub or a mock object (handwritten or otherwise), because they both look like the real object. Whether a fake is a stub or a mock depends on how it’s used in the current test. If it’s used to check an interaction (asserted against), it’s a mock object. Otherwise, it’s a stub.”

Let’s consider an example,

//an interface
public interface IWebService
    void LogError(string message);

and a hand-written fake

//note that this will only be a mock once we use it in some assert in some test
public class FakeWebService:IWebService
    public string LastError;
    public void LogError(string message)
        LastError = message;

What we want to test – When we encounter an error and post the message to our web service, the message should be saved as ‘LastError’. This means that we want to test whether our fake services’ LogError method appropriately updates the ‘LastError’ or not.

 public void Analyze_TooShortFileName_CallsWebService()
     FakeWebService mockService = new FakeWebService();
     LogAnalyzer log = new LogAnalyzer(mockService);
     string tooShortFileName="abc.ext";
     StringAssert.Contains("Filename too short:abc.ext",
                        mockService.LastError); // now mockService is a 'mock'
 public class LogAnalyzer
    private IWebService service;
    public LogAnalyzer(IWebService service)
        this.service = service;
    public void Analyze(string fileName)
           service.LogError("Filename too short:"
           + fileName);

Let’s take another example, a bit more involved.

Say your code logs error by calling service.LogError, as above. But now, let’s add some more complexity – if service.LogError throws some kind of exception you want to catch it and send an email to someone to alert that something is wrong with the service.

Something like this,


if(fileName.Length < 8)
        service.LogError("too short " + fileName);
    catch(Exception e)
        email.SendEmail("something wrong with service", e.Message);

Now, there are two questions in front of us –
1. What do we want to test?
2. How do we test it?

Let’s answer them –
What do you want to test
1.1 When there is an error like short file name then mock web service should be called
1.2 Mock service throws an error when we tell it to
1.3 When error is thrown, an email is sent

See if the following will work –
Your test code calls your LogAnalyzer code injecting it a fake web service that throws an exception when you tell it to. Your logAnalyzer will also need an email service that should be called when the exception is thrown.

It should work something like this –

But how will we assert that the email service was called correctly? – Set something in the mock email that we can assert later!

class EmailInfo
    public string Body;
    public string To;
    public string Subject;

public void Analyze_WebServiceThrows_SendsEmail()
    FakeWebService stubService = new FakeWebService();
    stubService.ToThrow=  new Exception("fake exception");
    FakeEmailService mockEmail = new FakeEmailService();
    LogAnalyzer2 log = new LogAnalyzer2(stubService,mockEmail);
    string tooShortFileName="abc.ext";

    EmailInfo expectedEmail = new EmailInfo {
                                       Body = "fake exception",
                                       To = "",
                                       Subject = "can’t log" }


public class FakeEmailService:IEmailService
    public EmailInfo email = null;
    public void SendEmail(EmailInfo emailInfo)
        email = emailInfo;

public interface IEmailService
    void SendEmail(string to, string subject, string body);

public class LogAnalyzer2
    public LogAnalyzer2(IWebService service, IEmailService email)
         Email = email,
         Service = service;
    public IWebService Service
         get ;
         set ; 
    public IEmailService Email
        get ;
        set ; 

    public void Analyze(string fileName)
            try {
                  Service.LogError("Filename too short:" + fileName);
                catch (Exception e)
                                    "can’t log",e.Message);

public class FakeWebService:IWebService
    public Exception ToThrow;
    public void LogError(string message)
            throw ToThrow;

Rule of thumb – once mock per test. Otherwise it usually implies that you are testing more than one thing in a Unit test.

Let’s say you have some code like –

String connString = GlobalUtil.Configuration.DBConfiguration.ConnectionString

and you want to replace connString with one of your own during testing, you could set up a chain of stubs returning your value – but that would not be so maintainable, would it!
On the other hand, you could also make your code more testable by refactoring it to be something like –

String connString = GetConnectionString();
public String GetConnectionString()
    return GlobalUtil.Configuration.DbConfiguration.ConnectionString;

Now, instead of having to fake a chain of methods, you would only need to do that for GetConnectionString() method.

Problem with handwritten mocks and stubs

1. Takes time to write
2. Difficult to write for big or complicated interfaces and classes
3. Maintenance of handwritten mocks and stubs as code changes
4. Hard to reuse in a lot of cases

Enter Isolation or Mocking frameworks

Definition – Frameworks to create and configure fake objects at runtime (dynamic stubs and mocks)

So, let’s use one! We will be using NSubstitute on VisualStudio for Mac Preview edition. This is part of VS IDE developed for .Net Core community.

The code examples mentioned below are self sufficient and easy to read. Please go through them in the following order-

1. LogAnalyzerTests.cs – Shows comparison between writing a handwritten fake and one using NSubstitute. Also shows how to verify the call made to a mock.
2. SimulateFakeReturns.cs – Shows how to make a stub return something we want whenever there is some particular or generic input.
3. LogAnalyzer2Example.cs – Again shows comparison between code written with and without a mocking framework. Shows how to throw exception using NSubstitute.

The above are simple examples showcasing API capabilities of a mocking framework – things we’ve been doing by handwritten code till now.

General statement on how mocking frameworks work – Much the same way you write handwritten fakes with the exception that these frameworks write code during run time. At run time, they give us the opportunity to override interface methods or virtual methods much the same way we’ve seen till now. Generating code at run time is not something new to the programming world. However, some frameworks allow much more than this – they even allow you to fake concrete classes and override their methods. How? In these cases the mocking frameworks inject code that you want in the .class or .dll and use something IF DEF kind of conditional running of code. Of course, this capability requires understanding how to inject or weave code at run time which further requires understanding of intermediate code targeting the runtime or VM.

Git Cheatsheet

This probably serves as a quick git reference or cheatsheet. I used to maintain this when I was new to git. Probably still helpful for beginners out there.

Revert all for files added but not committed
git checkout .

Adding/Staging files
git add .

Committing with message
commit -m “my message”

Make new branch (local)
git checkout -b new_branch

Then make that branch on server ready to push
git push -u origin feature_branch_name
Or simply
git push –all -u (to push all your local branches to server and set tracking for them too)

Change from one branch to another
git checkout another_branch_name

Merge branches
Suppose you have branched ‘master’ and ‘feature1’ and you want to bring the contents of ‘master’ into ‘feature1’, means you want to update your ‘feature1’ branch, then you do –

git checkout feature1
git merge master

if you want to bring in the contents of ‘feature1’ into ‘master’, when your ‘feature1’ work is done, then you do

git checkout master
git merge feature1

In fact, this 2 step process is the better way to merge your feature branches into master

In order to resolve conflicts, you might have to do
git mergetool

After all this is done, you do a commit merge and push from master

Equivalent of hg outgoing
If you want to list commits that are on your local branch dev, but not the the remote branch origin/dev, do:
git fetch origin # Update origin/dev if needed
git log origin/

Equivalent of git incoming
git fetch origin # Update origin/dev if needed
git log dev..origin/dev

See the history of a file

gitk /pathtofile/

When you want to setup a new repo in Github and already have some code in local

Create the remote repository, and get the URL such as or, add readme or .gitignore or whatever you want
Locally, at the root directory of your source, git init
git pull {url from step 1}
git add . then git commit -m ‘initial commit comment’
git remote add origin [URL From Step 1]
git pull origin master
git push origin master

Pull all branches to local

git fetch –all
git pull –all

List all branches

git branch -a –list all local branches
git branch -r –list all remote branches

and then do a simple git checkout fullbranchname to move into that branch

Pull certain branch from server

git pull origin brnach-name

Setting up git mergetool on windows

download kdiff3 exe, install it

open gitconfig file in C:\Program Files (x86)\Git\etc

open gitconfig using cmd in Admin mode, by notepad gitconfig command

add the following there –

tool = kdiff3

[mergetool “kdiff3”]
path = C:/Program Files/KDiff3/kdiff3.exe
keepBackup = false
trustExitCode = false

save it, close it

on git bash do
git config –global merge.tool kdiff3


Just download and install kdiff 3
And do the following

$ git config –global –add merge.tool kdiff3
$ git config –global –add mergetool.kdiff3.path “C:/Program Files/KDiff3/kdiff3.exe”
$ git config –global –add mergetool.kdiff3.trustExitCode false
$ git config –global –add diff.guitool kdiff3
$ git config –global –add difftool.kdiff3.path “C:/Program Files/KDiff3/kdiff3.exe”
$ git config –global –add difftool.kdiff3.trustExitCode false
Delete a branch from local

git branch -D branch_name

Delete a branch from remote

git push origin –delete <branchName>

Delete untracked file
git clean -f filenamewithpath
or git clean -f -d filenamewithpath


git clean -f -n to show what files will be removed
git clean -f to actually remove those

use git clean -fd to remove untracked directories

use git status to check whether something left untracked or not

Reverse a committed push
git reverse <commit’s hash> will create a new commit, which you’ll have to push

Cherry pick a commit
Lets say you want to cherry pick commit 6a23b56 from feature branch to master. You must be in master and then do
git cherry-pick -x 6a23b56
that’s all !

Removing Files
Say that u deleted a file from disk, now it will show as deleted in git status. How to make that change on server also?
git rm <file path and name>
git commit
git push

Moving repo from one location to another (or duplicating repo to some new location)
git remote add new_repo_name new_repo_url
Then push the content to the new location
git push new_repo_name master
Finally remove the old one
git remote rm origin
After that edit the.git/config file to change the new_repo_name to origin.

If you don’t remove the origin (original remote repository), you can simply just push changes to the new repo with
git push new_repo_name master

If you delete a file in one branch and don’t commit or stash then those files will appear deleted on other branches
Switching branches carries uncommitted changes with you. Either commit first, run git checkout . to undo them, or run git stash before switching. (You can get your changes back with git stash apply

Revert a specific file to some earlier git version
Find the commit where you went wrong, either using git log or git lop -p or gitk
Find the commit hash
git checkout commitcode filepath

Now commit again

Commit a single file
git commit -m ‘comments’ filepath

SSH setup between ur machine’s repo and server repo
Cd to home dir
ssh-keygen -t rsa -C “”
clip < ~/.ssh/

Save on server

How to see my last n commits
git log -n 5 –author=vaibhavk

How to see contents of a particular commit
git show hashvalue

What’s A, B and C in Kdiff3
A is the original file, before any merge conflicts happened
B is the your current file (including any uncommitted changes)
C is the incoming file that caused merge conflict

How to see uncommitted changes for a specific file against earlier version(?)
git diff filepath

Undo git add
Git reset filepath

See if a commit is in a branch or its in what branches
git branch -a –contains 4f08c85ad (remove -a to see only your local branches)
List git commits not pushed yet
git log origin/master..master
Or git log <since>..<until>
You can use this with grep to check for a specific, known commit:
git log <since>..<until> | grep <commit-hash>

Undo last commit

git commit -m “Something terribly misguided”
git reset HEAD~
<< edit files as necessary >>
git add <whatever>
git commit -c ORIG_HEAD

Making a Hadoop Cluster using Cloudera CDH 5.1.x | Experience

Let me start by saying that Linux, Open Source in general, JVM ecosystem and Hadoop rock!ele1


Installing a new cluster using Cloudera 5.1.x

Note that IPs mentioned here are just some dummy sample IPs chosen for this post!

  1. Install CentOS 6.5 on all the nodes. We chose CentOS but you could chose any other Linux flavour as well – just make sure it is supported by Cloudera
  2. Loop in your IT/Networking team to give static IPs to the nodes. Note that the machine you intend to make NameNode should have the following entries in its /etc/hosts file. This file is used by Cloudera and other nodes to resolve IP addresses and host names. In our case we kept as the NameNode – n37127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 n33 n36 n38 n39

    Note that for static IP setup you need to have at least the following entries in your /etc/sysconfig/network-scripts/ifcfg-eth0

    NAME=”System eth0″

    Once such entries have been made you will need to restart the network using the command service network restart

  3. The other 4  entries towards the end are for other data nodes (or slave machines). The data nodes should have the following entries in their /etc/hosts file. For example, our node had the following – n37 n33 localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

    Also, make sure that the entries in /etc/resolv.conf on all nodes (including namenode) should be –

  4. Setting up password-less logins – Next we set up ssh between the machines. Note that in our case we did everything by logging in as ‘root’. Other than that, our user was ‘cloudera’ and in our case, even the root password is ‘cloudera’. One thing I want point upfront is that you should keep the same passwords for all machines because Cloudera setup might want password to go into different machines.log into your name node as ‘root’.

    sudo yum install openssh-client (in Ubuntu’s case it will be ‘apt-get’ instead of ‘yum’)
    sudo yum install openssh-server

    ssh-keygen -t rsa -P “” -f ~/.ssh/id_dsa (generating ssh-key)   – note that the passphrase is empty string
    ssh-copy-id -i $HOME/.ssh/ username@slave-hostname (copy public key over to the node/slave machines. So, in our case, one example would be root@
    cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys (you need to do this if the same machine would need to ssh itself. We did this too.). Note that the work is not done yet. We need to setup password less login from data nodes to name node also. At this point you will be able to log into data nodes from namenode with ssh root@datanodeName/IP, like So, log into data nodes one by one and follow the procedure above to set password less ssh login from each node to name node. Once this is done, restart all machine using the command init 6. It’s awesome when you control so many machines from one!

  5. Other configurations –
    Disable selinux, vi /etc/selinux/config selinux=disabled
    chkconfig iptables off in /etc  –> restart of the nodes will be needed after this

    We have the following machines – as namenode, datanodes –,,,

  6. Java installation. Check what Java versions are supported by Cloudera. In our case, we chose to install the latest version of Java 1.7. It was 1.7.0_71

    Do the following on all nodes –
    Download the java rpm for 64 bit java 1.7 on any of machines ( Use scp command like this to transfer this rpm to all machines –
    scp /root/Downloads/jdk_xyz_1.7.0_71.rpm root@
    On all nodes –
    yum remove java (to remove earlier java versions)
    rpm -ivh /root/Downloads/jdk_xyz_1.7.0_71.rpmRunning this rpm should install Java, now we need to set the paths and all right. Note that Cloudera requires java to install in the folder (/usr/java/jdk_xyz_1.7.0_nn where nn is the version number – 71 in our case) –
    Now, no point in setting the environment variable like export $JAVA_HOME=whatever. Why? You will see that this variable is reset in each session of the bash. So, do like this -make a file like this –

    vi /etc/profile.d/

    type the following in –

    export PATH JAVA_HOME
    export CLASSPATH=.
    save the file

    make the file an executable by –
    chmod +x /etc/profile.d/

    run the file by
    source /etc/profile.d/

    Now, check the java -version, which java. Everything should be good (smile)

    Refer to –  and

  7. Now we are ready to begin the next battle – actual Cloudera Hadoop installation!
    If you use Linux on a regular basis then reading documentation should already be your second nature. In case it’s not, make it your second nature (smile)
    No matter how complete any article is, you should always refer to the actual documentation. In this case, we are trying to setup CDH 5.1.x (Cloudera Distribution for Hadoop version 5.1.x) and the documentation is at – . If it’s not here, you would still be able to find it with a simple google search!We would be going for automated installation of CDH (which is what is also recommended by Cloudera if you read their documentation. This is documented as ‘Path A’ installation!) –

  8. You should now have python 2.6 or 2.7 installed on your machine. Check if it’s already there by typing in which python and python -version. If it’s not there, thensu -c ‘rpm -Uvh

    yum install python26
    Download cloudera-manager-installer.bin from the cloudera site.
    chmod u+x cloudera-manager-installer.bin  (give permissions to the executable)
    sudo ./cloudera-manager-installer.bin – run the executable

    When the installation completes, the complete URL provided for the Cloudera Manager Admin Console, including the port number, which is 7180 by default. Press Return or Enter to choose OK to continue.
    Username: admin Password: admin

  9. From this point onwards, you are on your own to setup the cluster. I will point a few things here that I remember from our experience of cluster setup –

    the entry in vi /etc/cloudera-agent/config.ini, for server_host should be the IP of your namenode and on namenode, it should be ‘localhost’. In most cases, we found that the problem was either with our /etc/hosts file (which is why I mentioned the exact entries we had for our namenodes and datanodes) or with JAVA_HOME.

    We had used to option of ‘password’ when Cloudera asked us how it will login into the nodes (this is why we mention that make all root passwords same – ‘cloudera’ in our case). We also used the option of ’embedded Dbs’, it means that we did not install any PostGres or MySql databases on our own but we let Cloudera do that for us and we simply noted down the username, password, port numbers, etc. These are needed to serve as meta data holding databases for things like Impala, Hive, etc. Other than this we chose to install ‘Custom Services’ because we faced some problem installing Spark and we learned later that it was due to some known issue in which we had to transfer a jar file from spark folder into HDFS. So, chose not to install Spark right away. We can anyways do it later.  So, we went for the installation of HDFS, Impala, Sqoop2, Hive, YARN, Zookeeper, OOzie, HBase and Solr.

    One problem we faced was that even though we had told Cloudera that we have our own Java installed still it went ahead and installed it’s own java on our machine and used it’s path instead of our. For this, go to Cloudera Manager home page, click on the ‘Hosts’ tab, click on ‘Configuration’, go to ‘Advanced’ –  there you will find an entry by the name ‘Java Home Directory’ and change the path to point to the path of your installation. In our case, we pointed it to ‘/usr/java/jdk1.7.0_71’.

    Another thing I’d like to mention is that the CM web UI has some bugs. For example, when trying to re-configure the services we wanted to install, even though we had deselected a few services, it would still show them for installation. For this we had to simply delete the cluster and restart the whole process. In fact, we deleted and made the cluster many times during our cluster setup! So, don’t be afraid of this – go ahead and delete the whole cluster and re-do if you’re facing issues. Also, please restart the machines and the cluster many times (or when in doubt), you never know when something is not reflecting the latest changes.

    Our namenode directory points to /dfs/nn2 in the hdfs rather than the default /dfs/nn, because for some reason, our /dfs/nn was corrupted when we had tried to move a Spark assembly file to hdfs and it prevented the namenode to get formatted (even though we tried to delete this file).

    I’d also like to mention that we also changed the node wise configuration suggested by Cloudera. Basically, it was suggesting that we make as our namenode (even though there was no reason for this). Now Cloudera had installed on the machine from where we were running the cloudera manager. So, to avoid any issues, we just made sure that has all the services that Cloudera was suggesting for 33. Basically, we reversed the roles for 33 and 37!

    As if now, our cluster is up and running. There might still be some glitches because we haven’t verified the cluster by running anything. I’ll post the changes we make as we make them.

  • Some pics –

    ele4 ele6ele4ele5ele6

Making your subdomain respect index.html, index.php, etc.

subdomains image

I am not sure about you but I faced an issue when I was trying to add my resume as a separate subdomain on my domain. The problem was that when I added the index.html page into the subdomain folder and tried to browse the site – it just didn’t work! Although the main site also had an index.html and it worked perfectly fine, the subdomain was not respecting the special position that’s to be enjoyed by index.html. After some searching on the web I found that I had to modify the .htaccess present in my root. I just added the following line –

DirectoryIndex index.html index.php

and then it all worked just fine.