Author Archives: Kumar Vaibhav

About Kumar Vaibhav

I am an engineer, who, like most engineers, likes to read, write and learn. My interests encompass advanced mathematics, probability theory, linear algebra, NLP, AI, IR, Data Science, open source technologies and Big Data (Hadoop ecosystem). I am a .Net engineer by my work experience. I believe that an engineer has not really accomplished much unless he's coded something that he is really proud of.

Why ROC? — Intuitive Explanation

I’ve often found myself under­stand­ing con­cepts much bet­ter when I under­stand the geo­met­ri­cal inter­pre­ta­tion behind that con­cept. This is espe­cial­ly true for a lot of Math, Machine Learn­ing and AI (although there are def­i­nite­ly a lot of abstract con­cepts in this field too that often don’t have a geo­met­ri­cal inter­pre­ta­tion asso­ci­at­ed with them).

There might be many posts out there on RoC and what it is and how it’s used to deter­mine the prob­a­bil­i­ty thresh­old for a clas­si­fi­ca­tion prob­lem, but how is the geom­e­try of the prob­lem changes when we pick anoth­er thresh­old val­ue is some­thing I’ve nev­er encoun­tered any­where. And this post, I am going to try and intu­itive­ly explain every­thing. ELIF, or what’s the point, right?

Prob­lem State­ment

Let’s start with a sim­ple exam­ple that’s often used. Let’s say we are faced with pre­dict­ing whether a tumor is benign or malig­nant, and assume we have a bunch of fea­tures avail­able to us to do this task. For illus­tra­tion pur­pos­es, I am going to keep the graphs and equa­tions real­ly sim­ple, because the point it to under­stand the con­cept to be able to apply it lat­er in more com­plex sce­nar­ios.

So, we go ahead and build a clas­si­fi­ca­tion mod­el and the line/curve it comes up with to sep­a­rate the cat­e­gories is shown below -

So, as you can see above the line does a pret­ty job of clas­si­fi­ca­tion. The accu­ra­cy is high too. So, do we care about any­thing more? Yep! In the case above, there are two malig­nant tumors that have been clas­si­fied as benign. And that’s not a good thing for this par­tic­u­lar prob­lem. It depends on the prob­lem real­ly, whether we want our cri­te­ria of suc­cess to be over­all accu­ra­cy or true pos­i­tive rate or false neg­a­tive rate, etc. For a prob­lem like above, we want to def­i­nite­ly clas­si­fy as much malig­nant tumors cor­rect­ly as pos­si­ble with­out car­ing much about over­all accu­ra­cy or whether we mis­clas­si­fy some more benign one as being malig­nant. This is because the goal of such a clas­si­fi­ca­tion prob­lem should be to iden­ti­fy all malig­nant cas­es and inter­vene for patient care.

So, for prob­lems where we con­sid­er both class­es to be of equal impor­tance we can con­sid­er 0.5 prob­a­bil­i­ty as our thresh­old val­ue (any­thing above or equal to 0.5 is one class and any­thing below is anoth­er class). But for prob­lems like above, if we can choose a thresh­old such that it also cov­ers the left out malig­nant ones, that would be great. And as I will show, choos­ing a dif­fer­ent thresh­old basi­cal­ly means to move our curve/line of sep­a­ra­tion a lit­tle bit.

But before that here’s some math. You will note the fol­low­ing things from the image below -

  • Curve/line of sep­a­ra­tion cor­re­sponds to z = 0, which hap­pens at prob­a­bil­i­ty of 0.5. For ease of illus­tra­tion pur­pos­es y = x has been cho­sen
  • Region above curve/line of sep­a­ra­tion is where z is +ve and cor­re­sponds to region where prob­a­bil­i­ty is larg­er than 0.5. Illus­trat­ed by point (3, 5)
  • Region below curve/line of sep­a­ra­tion is where z is ‑ve and cor­re­sponds to region where prob­a­bil­i­ty is less than 0.5. Illus­trat­ed by point (5, 3)

So, as seen from above prob­a­bil­i­ty of 0.5 cor­re­sponds to like y — x = 0

What about prob­a­bil­i­ty of 0.4 or less? What does that mean geo­met­ri­cal­ly?
So, we know that any­thing less than 0.5 means we’re talk­ing about the region below the line. A prob­a­bil­i­ty p < 0.5 means that z is neg­a­tive. For illus­tra­tion pur­pos­es, let’s just assume some prob­a­bil­i­ty p(<0.5, maybe it’s 0.4 or less­er) such that val­ue of z is ‑0.7 (we can see from sig­moid func­tion that z will be neg­a­tive when prob­a­bil­i­ty goes below 0.5). What does z = ‑0.7 mean for the curve?
Well, yes, of course it means y — x = ‑0.7 or y = x — 0.7, which is basi­cal­ly anoth­er line shift­ed down a bit from the orig­i­nal line, as shown below -

So, essen­tial­ly set­ting thresh­old to any­thing but 0.5 shifts the line or curve. And now, we can see the ben­e­fit of such shift­ing — the pre­vi­ous­ly malig­nant tumors which were clas­si­fied as benign are now being clas­si­fied cor­rect­ly. So, the idea is to shift the line/curve such that it cap­tures what we want to do for our prob­lem — in this case we want­ed to increase True Pos­i­tive Rate, with­out being too nosy about over­all accu­ra­cy. So, now the ques­tion is — how do we actu­al­ly pick a good thresh­old? And that is exact­ly what the RoC (Receiv­er Oper­a­tive Curve) lets us do.

RoC Curve

Now that we know why we have reached here, the idea is pret­ty sim­ple — try out a range of prob­a­bil­i­ties (shift your curve/line a lot above z = 0 and below z = 0) and for each prob­a­bil­i­ty, cap­ture the True Pos­i­tive Rate (gen­er­al­ly you want this to increase) as well as the False Pos­i­tive Rate (gen­er­al­ly you want this to decrease). Now, plot these out. Some­thing like below, and choose the val­ue of prob­a­bil­i­ty thresh­old that makes more sense for your prob­lem. Also, you can use RoC curve to find out AuC (Area Under the Curve) to get a sense of how your mod­el is doing. You can also plot out var­i­ous mod­els’ RoC curves and then see which mod­el and which prob­a­bil­i­ty thresh­old makes most sense for your prob­lem -

Cross Validation — Intuitively Explained

Cross Val­i­da­tion is not a dif­fi­cult top­ic. But when it comes to under­stand­ing how to get the tun­ing para­me­ter using Cross Val­i­da­tion, a lot of peo­ple get con­fused. Hope­ful­ly, this blog might help out a lit­tle bit.

Let’s start from the begin­ning. 

What is Cross Val­i­da­tion?

Cross Val­i­da­tion is a gen­er­al tech­nique used to iden­ti­fy the bet­ter per­form­ing mod­el out of a bunch of giv­en mod­els. 

Let’s say we have some data and we divide it into train and test,
some­thing like this — 


But, why only this 25% is used to test? Why not the start­ing 25%? Even if we ran­dom­ly take a cer­tain 25% data, then why only that? The point is that if we train a mod­el with a cer­tain 75% data and use a cer­tain 25% data for test­ing, then we’ve intro­duced data bias in our mod­el — it works well for ‘that’ 75% and ‘that’ 25%. It also begs the fol­low­ing ques­tions — is only a cer­tain 25% of data good for test­ing and only a cer­tain 75% data good for train­ing?

Would­n’t it be bet­ter if we some­how get to lever­age the whole data set for test­ing as well as train­ing? 

And this is where K‑fold Cross Val­i­da­tion comes into play.

Basic Idea

Let’s say we have some data like so -

D = {x1y1, x2y2, .… xnyn}

and some ML mod­els m (1, 2,..c)
By the way, are these mod­els yet? I guess not! These are not ready yet to do any pre­dic­tions. They are sim­ply a con­fig­u­ra­tion of pre­dic­tor and fea­ture vari­ables. They will only become a mod­els once they pass through a data set. So, these m algo­rithms can be any mix of algo­rithms that you think should solve your prob­lem — like for clas­si­fi­ca­tion prob­lem these can be logis­tic regres­sion, SVM, Neur­al Nets, etc.

Any­how, so here’s what K‑fold CV does -

Step 1 — Set some val­ue of K. 5 or 10 are very com­mon choic­es. Now, shuffle/permute the data ran­dom­ly once.

Step 2 — Split the data into K folds (let’s say 5).

Step 3 — This is just for illus­tra­tion pur­pos­es. Since there are k‑folds (5 in this case), each ML algo­rithm will go through k iter­a­tions of train­ing — test­ing






For each of the iter­a­tions shown above, we have a test set and the rest is train­ing set. Now, for each algo­rithm m in your set — 

  1. Train the mod­el on Iter­a­tion 1 train­ing set and get the error by using its test set.
  2. Train the mod­el on Iter­a­tion 2 train­ing set and get the error by using its test set.

    .…. and so on for K iter­a­tions. Now, find the aver­age error as (1/k) x (sum of all errors). Note that these will be mis­clas­si­fied cas­es for clas­si­fi­ca­tion prob­lems and resid­u­als for regres­sion prob­lems. Instead, we can also find out accu­ra­cy for clas­si­fi­ca­tion prob­lems.

We repeat the above for all our algo­rithms and choose the one that has low­est aver­age error (or high­est accu­ra­cy).

So, what do we gain by this exer­cise? Just this — If there was going to be some bias in our final mod­el due to data selec­tion bias, then we have got­ten rid of that and have hope­ful­ly select­ed a bet­ter mod­el.

How can K‑Fold CV be used for Com­plex­i­ty Para­me­ter?

When we build a tree, we run the chance of build­ing a big, over­fit­ted tree. Think about it, if there’s noth­ing stop­ping tree gen­er­a­tion then it will try to fit the train data as best as pos­si­ble. Over­fit­ting would sim­ply mean that pre­dic­tion might not be good for test data. So, how do we go around this prob­lem? Well, we need to ‘prune’ our trees. 

And so what’s prun­ing? It’s merg­ing of nodes to make the tree short­er. As shown below, the tree is pruned at t2 node. 












As you can guess, the more pruned a tree is, the more Sum of Squared Error (in case of regres­sion trees) or more mis­clas­si­fi­ca­tion error (in case of clas­si­fi­ca­tion trees) it would have — tree with just node would have the most error (most under­fit tree) and the largest tree would have the least error (most over­fit tree). The job of prun­ing is to find the bal­ance here — so that we are able to iden­ti­fy a tree mod­el that does not over or under fit. And prun­ing does this by some­thing called Cost Com­plex­i­ty Prun­ing.




In sim­ple words, the idea is this — we add this extra term α|T | (Tree Cost Com­plex­i­ty) to the total cost and we seek to min­i­mize the over­all cost. |T| is the total num­ber of ter­mi­nal nodes in the tree and α is the com­plex­i­ty para­me­ter. In oth­er words, this term is basi­cal­ly penal­iz­ing big trees — when the num­ber of leaves in the tree increase by one, the cost increas­es by α. Depend­ing on the val­ue of α(≥ 0) a com­plex tree that makes no errors may now have a high­er total cost than a small tree that makes a num­ber of errors! This is how this term enables us to find a good tree. 

Also, con­vince your­self that for a giv­en α, there can only be one tree T that will min­i­mize the loss func­tion. 

But. How to find  α?

And this is where CV comes in. Here’s what the APIs do as an overview (of course, there would be dif­fer­ences in imple­men­ta­tions and more sub­tleties but the point here is to get the intu­ition right) — 

We take a range of α’s we want to try out. For each of there α’s and a giv­en data set and pre-deter­mined val­ue of K(for CV), we do the fol­low­ing — 

  • For α1, find the best tree for Fold 1
  • For α1, find the best tree for Fold 2
  • ..and so on, For α1, find the best tree for Fold K
  • For α2, find the best tree for Fold 1
  • For α2, find the best tree for Fold 2
  • ..and so on, For α2, find the best tree for Fold K
  • …and so on, till what­ev­er num­ber of α’s you want to try out (typ­i­cal­ly try­ing out any­thing between 30 — 50 are prob­a­bly enough)

So, for each α, we find the aver­age accu­ra­cy of the mod­el (accu­ra­cy in case of clas­si­fi­ca­tion mod­els or RMSE/SSE in case of regres­sions mod­els. Let’s go with Accu­ra­cy for illus­tra­tion pur­pos­es). We can plot this out like below (in prac­tice you won’t need to plot it out and the CV api can tell you the best val­ue of alpha) and see for what val­ue of α we get the high­est accu­ra­cy and we choose that α and plug in our orig­i­nal tree’s cost func­tion and build a good tree! Note that CV is not gen­er­al­ly used for Ran­dom Forests or GBM kind of algo­rithms — this is because CV won’t be much effec­tive there since these algo­rithms already have a lot of ran­dom­ness built into them, so chance over­fit­ting of great­ly reduced. 

Also, note that the shape of the graph below is such because when val­ue of α is low­er then it will favor big­ger trees and accu­ra­cy will be high (if not high­est), and when it’s val­ue keeps get­ting big­ger then it favor more and more short trees, the short­er the tree the less­er it’s accu­ra­cy. A sweet spot is, there­fore, some­where in the mid­dle.

So, once we find our α, we can use it to build our new tree mod­el rather than using arbi­trary para­me­ters like min­buck­et, etc. to lim­it our trees.

Think about what CV did here — we had a bunch of mod­els we built using dif­fer­ent val­ues of α, we used CV on them to find a good val­ue of α and then even­tu­al­ly use α to find a good sin­gle tree mod­el. So, this is just an appli­ca­tion of what CV does in gen­er­al — to find a good mod­el, giv­en a bunch of mod­els.

Understanding GBM intuitively

A few years back, I was assigned the task of Pre­dict­ing the prob­a­bil­i­ty of stu­dents drop­ping out of col­lege so that there could be some inter­ven­tion before­hand to help stu­dents at risk.
For this exer­cise I had to first build a Per­son Match­ing Algo­rithm between cred­it card data and the stu­dents data to be able to find out the finan­cial con­di­tion of the stu­den­t’s fam­i­ly — since a lot of times this is also a fac­tor for drop­ping out. But that’s a top­ic for anoth­er day.

In this post, I am going to write a lit­tle bit about the main ML algo­rithm I used — GBM or Gra­di­ent Boost­ed Machine learn­ing. It’s also one of the top algo­rithms used in Kag­gle com­pe­ti­tions and is pret­ty reli­able. When I used this algo­rithm, I was already well-versed with Deci­sion Trees, Ran­dom Forests, etc. but I still had a hard time get­ting the intu­ition behind GBM and why it would work and why it was designed the way it is. In this post, I am going to try to break this down for the read­er, but the post assumes that the read­er is already famil­iar with how Deci­sion Trees (Regres­sion as well as Clas­si­fi­ca­tion) are grown.

Before we dive into GBM, we can dis­cuss a lit­tle bit about how the world got here -

  1. Sin­gle Trees — First came the Deci­sion Trees. These are sin­gle trees made from train­ing data and are only good for visu­al rep­re­sen­ta­tion of the prob­lem. Nev­er use them in pro­duc­tion because they are very sen­si­tive to data and the pre­dic­tion val­ues change very eas­i­ly for even slight vari­a­tions in train­ing data. Also, def­i­nite­ly not use sim­ple trees when data is skewed in favor of one class — it will almost always pre­dict that same class.
  2. Bag­ging — Due to prob­lems aris­ing from sin­gle trees, peo­ple start­ed to look at — well, what do peo­ple look at when sin­gle things don’t work well? Cre­ate many mod­els and aver­age out the result. That’s called Bag­ging. It comes from the phrase ‘boot­strap aggregating’. There are two tech­niques involved here — boot­strap­ping (of data) and aggre­gat­ing (or aver­ag­ing of mod­els). In the boot­strap­ping phase, a num­ber of train­ing data sets are cre­at­ed by ‘sam­pling and replace­ment’! What does it mean? Well, it sim­ply means that a data set can be cre­at­ed from orig­i­nal data set for train­ing where a per­cent­age of records will be ran­dom­ly sam­pled and since sam­pling is not exclud­ing any­thing (even records drawn ear­li­er), there could be a lot of records that are repeat­ed. And we cre­ate many such data sets — as you can see this intro­duc­tion of ran­dom­ness in data basi­cal­ly reduces the pos­si­bil­i­ty of over­fit­ting. That was for the datasets. Now, for the mod­el — instead of choos­ing just one mod­el, one can cre­ate a num­ber of mod­els with cer­tain vari­a­tions. Now, these m num­ber of mod­els can be fit­ted with m num­ber of new datasets and result aver­aged. Ran­dom for­est is basi­cal­ly just this. To cre­ate the dif­fer­ent mod­els, when it builds each tree, for each split only a cer­tain num­ber of fea­tures are cho­sen ran­dom­ly.
  3. Boost­ing — Ran­dom forests work well. The ran­dom­ness in data and ran­dom­ness in mod­els mean that the chance of over­fit­ting is great­ly reduced. But does that nec­es­sar­i­ly mean a good mod­el? Not real­ly! And the rea­son is quite sim­ple. Ran­dom for­est is basi­cal­ly tak­ing a major­i­ty vote (or aver­age val­ue in case of a ‘regres­sion trees’ for­est) from a num­ber of inde­pen­dent­ly cre­at­ed mod­els. That’s right. The trees in RF are not relat­ed to each oth­er at all. So, there’s no chance that they are learn­ing from each oth­er’s mis­takes. That’s where Boost­ing comes in! It’s a tech­nique where a num­ber of mod­els are cre­at­ed and all are used to get the final vote (or val­ue), but the each new mod­el is built by improv­ing on the pre­vi­ous mod­el. So, boost­ing is basi­cal­ly say­ing — ‘I have this stu­pid ini­tial mod­el, which is basi­cal­ly an aver­age mod­el. Now, let’s build anoth­er model/tree based on this such that it’s a lit­tle bet­ter. And let’s keep doing this till we (let’s say) build 100 mod­els’!

In short, here’s what GBM does — it com­putes an ini­tial sin­gle pre­dic­tion val­ue. Then it com­putes resid­u­als for pre­dict­ed and actu­al val­ues, then it builds a tree to pre­dict the resid­u­als. Now, it replaces the ini­tial pre­dic­tions with new ones — with new pre­dic­tion val­ue being: ini­tial pre­dic­tion val­ue + pre­dict­ed resid­ual val­ue. Once we get new pre­dic­tion val­ues, we again cal­cu­late new resid­u­als and build anoth­er tree to pre­dict these resid­u­als. We keep doing this for M num­ber of trees, where M can be user input. As you can see, any tree that’s gen­er­at­ed depends on ear­li­er pre­dic­tion val­ue. This is why GBM is said to be addi­tive and it’s an ensem­bling tech­nique. This is also the main idea behind boost­ing — using pre­vi­ous pre­dic­tions to come up with new mod­els. In our case, if you see close­ly, we build our first tree based on the resid­u­als from the first pre­dic­tion, we build our sec­ond tree based on the resid­u­als from first tree, and so on.

Here’s what GBM looks like -

This can seem daunt­ing at first, but it’s not.

Here’s step 0 — Let’s assume we have some train­ing data avail­able to us of the form (x, y), for i=1 to n, where the y is some regres­sor (let’s just take regres­sion as an exam­ple to under­stand GBM. Clas­si­fi­ca­tion is sim­i­lar but needs some trans­for­ma­tions between steps). x here is not one fea­ture but all val­ues for all fea­tures for a sin­gle record and n is the total num­ber of records. Here’s a visu­al rep­re­sen­ta­tion of this data -

Now, let’s get to step 1.
This step is ask­ing us to cal­cu­late f0(x). This is a sin­gle val­ue. As a mat­ter of fact, GBM starts with a sin­gle pre­dic­tion val­ue. If you were giv­en a bunch of data and asked to give a sin­gle val­ue that you would pre­dict for that data, what would it be? You’ll sim­ply take aver­age of y val­ues from train­ing data and for any incom­ing new data, this is what you’ll pre­dict. As we’ll see, f0(x) here, for regres­sion is basi­cal­ly the aver­age of all y val­ues in the data.

L is the loss func­tion that we need to min­i­mize. Noth­ing new here, all ML algo­rithms min­i­mize a loss/cost func­tion. Input for L is y val­ues from train­ing data and the pre­dict­ed val­ue γ. Note that since we only need one pre­dic­tion val­ue, this is going to be same for all for all records. Now, for regres­sion loss func­tion usu­al­ly tak­en is — 1/2 x (actu­al — predicted)^2

There are oth­er Loss func­tions too for regres­sion that one could use but let’s go with sum of squares of resid­u­als, which is also very com­mon. The Loss func­tion must be dif­fer­en­tiable and must be con­vex so that we can apply gra­di­ent descent to it. And we know that sum of squares of resid­u­als fol­lows these prop­er­ties. Below is a rough graph of one fea­ture vari­able based loss func­tion against pre­dic­tion val­ues. The shape of the curve is such because of the square term that we have in our loss func­tion. Note that now we need to take deriv­a­tive of it and set that to 0, to get to the point where we have the pink line touch­ing the graph hor­i­zon­tal­ly, since that’s the point where the val­ue of loss func­tion would be low­est.

If you com­pute the derivate, it will come out to be -> -(actu­al — pre­dict­ed). Now, let’s say we have 3 records where y was 1, 4, 5. Putting these in the equa­tion and set­ting it to zero we get, we get -> -(1 — pre­dict­ed) + (-(4 — pre­dict­ed)) + (-(5 — pre­dict­ed)) = 0. Solv­ing this for pre­dict­ed, we get ->

pre­dict­ed = (1 + 4 + 5) / 3
So, f0(x) is noth­ing but aver­age of all y val­ues.

Let’s move to step 2.
In this step, M is the total num­ber of trees we want to build. This is usu­al­ly a user input, like 100 (or 500!).

All 2(a) is doing is cal­cu­lat­ing the resid­u­als. The deriv­a­tive you see is the same as in step 1. Note that to cal­cu­late the resid­u­als its sim­ply plug­ging in val­ues from pre­vi­ous pre­dic­tion (f = f of m‑1) and deriv­a­tive of L comes out to be -(actu­al — pre­dict­ed). Note the minus sign for the over­all val­ue? That’s to get rid of the minus sign in -(actu­al — pre­dict­ed). And what’s ‘pre­dict­ed’ val­ue? It is sim­ply f0(x) when m = 1. It will be f1(x) when m = 2 and so on.

In 2(b), we are build­ing a regres­sion tree to pre­dict the resid­u­als from 2(a). The tree is built with a set num­ber of nodes (or set num­ber of ter­mi­nal leaves) so that we don’t end up grow­ing super large trees. Note that j in this step is leaf index and m is tree index.

In 2©: So, in 2(b) we made a regres­sion tree. But, it might hap­pen that in a num­ber of cas­es the leaves we get have more than one val­ues in them. So, then what’s the one val­ue for a giv­en ter­mi­nal region R? Its sim­ple — it’s the aver­age of the val­ues present in that ter­mi­nal leaf/region! 2© sim­ply updates the tree with final ter­mi­nal region val­ues. This is why the equa­tion in this step resem­bles the equa­tion from step 1, because all it’s doing is aver­ag­ing!

2(d): This does the new pre­dic­tion. All it’s doing is — tak­ing pre­vi­ous mod­el’s pre­dic­tion val­ue and adding to it the resid­ual val­ue to get new pre­dic­tion val­ues for each records.

Note that a learn­ing rate is also usu­al­ly applied to resid­ual val­ue (some­thing between 0 and 1), so as to lessen the effect of vari­ance.

Once 2(d) is done, we go back to 2(a) and cal­cu­late resid­u­als and then pre­dict those resid­u­als and so on, till M.

Once step 2 fin­ish­es, our mod­el (step 3) is ready and pre­dic­tion.

By no means it’s an easy algo­rithm to grasp, but hope­ful­ly you can start with a a very sim­ple data set and work through these steps as explained to get more under­stand­ing of the algo­rithm.

Simple Single Document Summarizer

A few years back I cre­at­ed a Sin­gle Doc­u­ment Sum­ma­riz­er — a sta­tis­ti­cal heuris­tics dri­ven Java appli­ca­tion (with GUI) that takes in cer­tain text and sum­ma­rizes it. In this blog post, I am going to dis­cuss some tech­niques I used for sum­ma­riza­tion. These are rudi­men­ta­ry tech­niques but still work well.

Code can be found at — (might have a few repet­i­tive direc­to­ries as well)

Sen­tence extrac­tion is done via regex, after which nor­mal NLP pre­pro­cess­ing is done — Stem­ming (using Porter’s rule-based algo­rithm) and Stop words removal. After tok­eniza­tion, we have a sen­tences to tokens map­ping. This is all pret­ty straight­for­ward. I also used DUC 2002 (Doc­u­ment Under­stand­ing Con­fer­ence) cor­pus to test the algo­rithm for it’s effi­ca­cy but all that’s for the paper.

Now, let’s dis­cuss the fea­tures used to give weights to sen­tences -

  • Top­ic Seg­men­ta­tion Words — If we are able to find impor­tant words that cor­re­spond to sub top­ics in the text, then sen­tences con­tain­ing those words are prob­a­bly impor­tant and should be giv­en more weight.  This is also impor­tant to get cov­er­age of the top­ics writ­ten about in the text.

For find­ing such words we have used tf-isf (term fre­quence — inverse sen­tence fre­quen­cy) along with word den­si­ty score dens(w).

The tf.isf score is -


The above equa­tion cal­cu­lates tf.isf for word w in sen­tence s. stf (w, s)/|s| is the nor­mal­ized fre­quen­cy of word w in sen­tence s where |s| is the total num­ber of words in s. The log term is the inverse sen­tence fre­quen­cy. Here, Ns is the total num­ber of sen­tences in the doc­u­ment and sf (w) is the num­ber of sen­tences in which w occurs at least once. This gives us a hold over the dis­tri­b­u­tion of the word through­out the doc­u­ment. If a word occurs very often through­out the doc­u­ment then its isf score would be low and if it occurs in only a few places then its isf score would be high. The intu­ition here is that the words that appear in only a few places are bet­ter can­di­dates for top­ic seg­men­ta­tion.
But there’s a prob­lem. Even if a word appears at very few places (i.e., it has good isf) but the places where it appears are very far off in the text then that word is also not well suit­ed for top­ic seg­men­ta­tion. The intu­ition here is that if a word is rep­re­sen­ta­tive of a sub-top­ic then it should occur fre­quent­ly in a spe­cif­ic region and infre­quent­ly in oth­er parts. For this pur­pose we’ll cal­cu­late the word den­si­ty as fol­lows:


Here, occur (k) and occur (k+1) rep­re­sent the con­sec­u­tive posi­tions of w in the text and the dist func­tion cal­cu­lates the dis­tance between them in terms of words. |w| is the total num­ber of occur­rences of w in the doc­u­ment. When we sum up these inverse dis­tances we get a high­er val­ue for words that are dense in a region and low­er val­ues for dis­persed words.

Com­bin­ing these two, we have — tf.isf (w, s) x dens (w)

  • Sen­tence Loca­tion — The intu­ition here being that start­ing and end­ing sen­tences are prob­a­bly impor­tant. So, we nor­mal­ize the sen­tence posi­tions between 0 and 1 and give more weight to start­ing and end­ing sen­tences.
  • Posi­tion of next sen­tence — Posi­tion of a sen­tence may also have effect. For exam­ple — “Sachin is an excel­lent bats­man. He lives in Mum­bai. He has played a lot of crick­et. A lot of crick­et is played in India”. Here we are talk­ing about a play­er named “Sachin”. The first sen­tence is impor­tant due to the fact that we con­tin­ue to talk about “Sachin” in the sec­ond sen­tence. The sec­ond sen­tence also finds its rel­e­vance since we con­tin­ue to talk about “Sachin” in the third sen­tence too. Now the third sen­tence is not as impor­tant as the first and sec­ond ones since the fourth sen­tence does not talk about “Sachin”. So, let’s give impor­tance to a sen­tence if the sen­tence fol­low­ing it refers to it. To have used the approach of iden­ti­fy­ing cue words like ‘alter­na­tive­ly’, ‘although’, ‘accord­ing­ly’ etc. to find such sen­tences. The for­mu­la used is -

weight = num­ber of cue phras­es in the sen­tence / total num­ber of words in sen­tence

Also, if a sen­tence has a sen­tence fol­low­ing it in the same para­graph which starts with a pro­noun, then we also add 0.1 to the weight of the sen­tence.

  • Title words — If the sen­tence has words that are used in the title (except stop words) then the sen­tence maybe indica­tive about what the text is about. As such, we give more weights to such sen­tences.
  • Theme words — We try to find words that are rep­re­sen­ta­tive of the themes present in the text. For this, we sort the nor­mal­ized term fre­quen­cies and take the first 4 words with high­est fre­quen­cies. So, now the sen­tence weight for this fea­ture is -

weight = num­ber of theme words in the sen­tence / total num­ber of words in sen­tence

What are nor­mal­ized term fre­quen­cies? For a giv­en term t1, the nor­mal­ized term fre­quen­cy         tf1 = total fre­quen­cy of t1i / max fre­quen­cy

  • Prop­er Nouns — It can be said that sen­tences con­tain­ing prop­er nouns have more infor­ma­tion to con­vey. So, we can use this to add weight to sen­tences too.

weight = num­ber of prop­er nouns in sen­tence / total num­ber of words in sen­tence

  • Sen­tence Length — Length­i­er sen­tences con­tain more infor­ma­tion.

weight = num­ber of words in sen­tence / max sen­tence length in doc­u­ment

  • Punc­tu­a­tion — Cer­tain punc­tu­a­tion are impor­tant to iden­ti­fy impor­tant sen­tences. For exam­ple, excla­ma­tion mark (!) may mean some sud­den thought or emo­tion. Sim­i­lar­ly, a ques­tion fol­lowed by an answer should also have good infor­ma­tion.

weight = total punc­tu­a­tion in the sen­tence / total words in sen­tence

We’ve omit­ted some punc­tu­a­tion as well as ? and ! have been giv­en more impor­tance (adding 1.0 for each of these)

  • Numer­ic Data — Sen­tences con­tain­ing numer­i­cal data can be impor­tant. They can have impor­tant sta­tis­tics.

weight = total numer­i­cal data in sen­tence / total words in sen­tence

After we have these indi­vid­ual weights for each sen­tence, we com­bine them using a lin­ear com­bi­na­tion to find total sen­tence weight: α(sentence loca­tion weight) + β(weight due to next sen­tence) +γ(title word weight) + δ(term fre­quen­cy weight) + ε(theme word weight) + ζ(weight due to prop­er nouns) +η(weight due to cue phras­es) +θ(weight due to top­ic seg­men­ta­tion words) +ι(weight due to sen­tence length) +κ(weight due to punc­tu­a­tion) +λ(weitght due to numer­ic data), where the greek let­ters have weights between 0 and 1 and can be tweaked to influ­ence the weight of fea­tures.

Once this is done we can select the top x % of rank­ing sen­tences and then re-arrange them in sum­ma­ry in the same order they were in the orig­i­nal text. Here, x can be user input.

Minimal Gradle Post | What you need to know

I’ve been using Gra­dle for some­time at the work­place and I find it to be a good build tool. In the past I’ve SBT, Maven and Nant as well as, well, MSBuild. It was only MSBuild that I grew much famil­iar with back in my .net days.

This is a min­i­mal Gra­dle post for who­ev­er wants to get start­ed with it with little/no fuss. Let’s get right into it.

What’s Gra­dle used for — It’s a build man­age­ment tool. What this means is that you can use Gra­dle to man­age build­ing your projects, man­age depen­den­cies with­in projects and exter­nal to them, man­age how tests are run, write pre/post build tasks/targets, and so on. You get the idea.

Two main ideas in Gra­dleProjects and tasks

A gra­dle build can con­sist of one or more projects and each project can have one more more tasks. Tasks are noth­ing but units of work need to be done. A gra­dle project needs to have at least one build.gradle file. If there is only one such file it needs to be present in the root direc­to­ry of the project. But you can addi­tion­al­ly have build.gradle for each project in your project. Gra­dle build files con­tain DSL code and can also con­tain Groovy or Kotlin code. The DSL is basi­cal­ly a declar­a­tive kind of lan­guage.

At this point, let’s dive into code and see a few things first hand. I will be using Intel­liJ IDEA to set up a Gra­dle project. You are free to use any edi­tor of your choice as long as it has sup­port for Gra­dle.

Down­load Intel­liJ, install Gra­dle plu­g­in (it would be checked by default), Set­up a new ‘Gra­dle’ project. You will see a build.gradle file set­up for you as well as a settings.gradle file.

Note that you will also see a few more things — gradlewrap­per, and gradlew.bat files

So, what’s gradlewrap­per?
Say you run mul­ti­ple projects on your lap­top. Some might be using SBT, some Maven, some Gra­dle, etc. Gradlewrap­per let’s you use Gra­dle local to your project by pro­vid­ing Gra­dle jars that don’t need to be avail­able sys­tem wide. Mean­ing that you won’t need to install Gra­dle on your sys­tem. You can use the Gra­dle jars and scripts that come with Gradlewrap­per and that’s the end of sto­ry. To use Gra­dle through Gradlewrap­per you will need to sub­mit your Gra­dle com­mands to the scripts that ship with Gradlewrap­per — gradlew.bat for win­dows and for *nix, much the same way as you would use the ‘gra­dle’ com­mand from com­mand line.

so, for exam­ple a nor­mal gra­dle com­mand could look some­thing like


but if you don’t want to install Gra­dle on your sys­tem and rather just use Gradlewrap­per, you will write some­thing like


Project list­ing goes in settings.gradle
settings.gradle will con­tain at least one project name — the root project, by default it’s the name of the direc­to­ry con­tain­ing the main/root project of your set­up. You can of course, change this name to any­thing of your lik­ing. Over time, if you have more projects, their names also go in settings.gradle file.

The Gra­dle sys­tem is rich with plu­g­ins. All plu­g­ins avail­able can be seen at this url. Plu­g­ins pro­vide addi­tion­al func­tion­al­i­ty to your Gra­dle set­up. Plu­g­ins might have pre-writ­ten tasks that you might not want/have to write your­self, apart from lot more func­tion­al­i­ties.
For exam­ple, if you want Scala to be the lan­guage for your project and you’d like Gra­dle to make src/test direc­to­ries for the same, you can use the ‘scala’ plu­g­in and ‘refresh’ your build.gradle and you will see the scala relat­ed direc­to­ries in your project struc­ture.
(By the way, you can also do a ‘refresh’ by exe­cut­ing gra­dle –refresh-depen­den­cies task from com­mand line)

For instance, my build.gradle file for this exer­cise looks like this —

group 'org.practice'
version '1.0-SNAPSHOT' //when version is set you will have a jar file with 
//name like mylibName-1.0-SNAPSHOT.jar, in this case

apply plugin: 'scala'

sourceCompatibility = 1.8

repositories {

dependencies {
    //example on how to target jars in a specific local directory
    //compile files('/home/vaibhav/spark/spark-2.1.0/core/target/spark-core_2.11-2.1.0.jar')
    compile group: 'org.apache.spark',name: 'spark-core_2.11', version: '2.1.0'
    testCompile group: 'junit', name: 'junit', version: '4.11'

Cou­ple of things to note above -
1. You can see how to apply a plu­g­in.
2. When you might have set­up your sam­ple project(as men­tioned above) as a gra­dle project, you would have been asked to sup­ply groupId and arti­fac­tId. GroupId, Arti­fac­tId and Ver­sion, also called GAV, is a pret­ty stan­dard way in the mvn world to iden­ti­fy a library. Gra­dle fol­lows the same mvn con­ven­tion and in fact, the gra­dle project struc­ture by con­ven­tion, you will find, is much like mvn project struc­ture.


3. There would be a source­Com­pat­i­bil­i­ty option mean­ing that you com­pile your code to be com­pat­i­ble with a cer­tain ver­sion of java run time.
4. Under ‘repos­i­to­ries’, you can list the repos Gra­dle will hit to search for the depen­den­cies men­tioned under ‘depen­den­cies’. You can men­tion any num­ber of repos by pro­vid­ing a url. Maven­Cen­tral is avail­able by default.
You can also spec­i­fy local file sys­tem path to load depen­den­cies from, some­thing like

run­time files(‘lib/project1.jar’,‘lib2/project2.jar’)
There are many ways to spec­i­fy how to pick, from where to pick, and what to exclude. I am sure you can find the right syn­tax when you need it.

5. You can see that // works for com­ments, as well as /* .… */
6. When men­tion­ing what depen­den­cies to down­load, you pro­vide their GAV. By the way, if you don’t tell gra­dle where to down­load your depen­den­cies, it will do so by default at ~/.gradle/caches/
7. You can group your depen­den­cies — above you can see that I don’t need junit in my main project but I need it in my test project. So I use a pre-built group ‘test­Com­pile’. Sim­i­lar­ly, for the ‘com­pile’ group.

If you’d rather not use gradlewrap­per, you can install gra­dle using brew in Mac or apt-get install on Ubun­tu, etc. You might need to set GRADLE_HOME if the instal­la­tion does not already does that for you.
You can spec­i­fy your gra­dle spe­cif­ic set­tings like what jvm to use, whether to keep gra­dle dae­mon alive(recommended set­ting), etc. in a file. If you are using gradlewrap­per, you should already have file for you.

Gra­dle Tasks
So far, we’ve talked about set­ting up gra­dle and about projects. Let’s take a look at tasks now.
A build.gradle file can have many tasks, these tasks can be grouped or not grouped. When not grouped, a task is con­sid­ered a ‘pri­vate’ task and will show up under ‘Oth­er tasks’ when see­ing the out­put of com­mand ‘gra­dle tasks’, which lists all tasks. Here’s an exam­ple of a grouped task and an ungrouped task-

task calledOnce {
    println("I will get printed first")

task helloworld {
    group 'some group name'
    description 'this is desc for task helloworld'

    doFirst {
        println("hello world task begins")

    doLast {
        println("hello world task ends")

> gradle helloworld

> I will get printed first
> hello world task begins
> hello world task ends

doFirst, doLast are phas­es with­in a task. State­ments in task clo­sure not under any phase by default go into con­fig­u­ra­tion phase and are exe­cut­ed once before any oth­er phas­es, no mat­ter whether you call that par­tic­u­lar task or not.

Tasks can depend on each oth­er
This is a nor­mal pro­ce­dure for most build scripts. Exe­cu­tion order of tasks depends on which tasks depend on which ones.


defaultTasks 'compile' //since a default task has been specified, on the command line it will suffice to just type 'gradle'

apply plugin: 'scala'

task compile(dependsOn: 'compileScala') {
   doLast { 
       println("compile task")

//shorter syntax for defining doLast, doFirst
compileScala.doLast(dependsOn: 'clean') {
    println("compileScala task given by scala plugin")

task clean {
    doLast { 
       println("clean task")

Note that you can also cre­ate your own tasks by extend­ing the Default­Task class, but this is not some­thing we are going to try in this post.

Mul­ti-project build struc­ture
Usu­al­ly you would have a few sub-projects or mod­ules in your project and they also need to be built, tests exe­cut­ed for them and there might be depen­den­cies among these sub-projects. Of course, gra­dle gives you option here. Let’s say you have project struc­ture like this —

— Web
— Core
— Util

In this case, your settings.gradle would con­tain ‘include’ some­thing like this —

include 'Core', 'Web', 'Util' //telling gradle we have sub projects

Now, the build.gradle file needs to cater for these sub-projects as well. There will be some tasks that are spe­cif­ic to sub-projects, some tasks com­mon for all, etc. Take a look —

//will be called once for each project, including root
allprojects {
    //groupId: ....
    //version: ....

//does not apply to root
subprojects {
    //apply specific plugins, etc.

//root_project specific dependencies
project(':org.practice.root_project').dependencies {
    compile project(':org.practice.Core'), project(':Util') 
    compile 'some lib'


allprojects {

subprojects {

//Core specific stuff
project(':Core') {
    dependencies {

//Util specific stuff
project(':Util') {

//Web specific stuff
project(':Web') {

Note that sub-project spe­cif­ic stuff could also go under their own sep­a­rate build.gradle files.

Pub­lish­ing arti­facts
Pub­lished files are called arti­facts. Pub­lish­ing is usu­al­ly done with the help of some plu­g­in, like maven.

apply plugin: 'maven'

uploadArchives { //maybe some nexus repo or local
    repositories {
        mavenDeployer {
            repository(url:"some url")

Hope the above gives you nec­es­sary infor­ma­tion to get going with gra­dle! Thanks for read­ing!