Category Archives: Statistics

Why ROC? — Intuitive Explanation

I’ve often found myself under­stand­ing con­cepts much bet­ter when I under­stand the geo­met­ri­cal inter­pre­ta­tion behind that con­cept. This is espe­cial­ly true for a lot of Math, Machine Learn­ing and AI (although there are def­i­nite­ly a lot of abstract con­cepts in this field too that often don’t have a geo­met­ri­cal inter­pre­ta­tion asso­ci­at­ed with them).

There might be many posts out there on RoC and what it is and how it’s used to deter­mine the prob­a­bil­i­ty thresh­old for a clas­si­fi­ca­tion prob­lem, but how is the geom­e­try of the prob­lem changes when we pick anoth­er thresh­old val­ue is some­thing I’ve nev­er encoun­tered any­where. And this post, I am going to try and intu­itive­ly explain every­thing. ELIF, or what’s the point, right?

Prob­lem State­ment

Let’s start with a sim­ple exam­ple that’s often used. Let’s say we are faced with pre­dict­ing whether a tumor is benign or malig­nant, and assume we have a bunch of fea­tures avail­able to us to do this task. For illus­tra­tion pur­pos­es, I am going to keep the graphs and equa­tions real­ly sim­ple, because the point it to under­stand the con­cept to be able to apply it lat­er in more com­plex sce­nar­ios.

So, we go ahead and build a clas­si­fi­ca­tion mod­el and the line/curve it comes up with to sep­a­rate the cat­e­gories is shown below -

So, as you can see above the line does a pret­ty job of clas­si­fi­ca­tion. The accu­ra­cy is high too. So, do we care about any­thing more? Yep! In the case above, there are two malig­nant tumors that have been clas­si­fied as benign. And that’s not a good thing for this par­tic­u­lar prob­lem. It depends on the prob­lem real­ly, whether we want our cri­te­ria of suc­cess to be over­all accu­ra­cy or true pos­i­tive rate or false neg­a­tive rate, etc. For a prob­lem like above, we want to def­i­nite­ly clas­si­fy as much malig­nant tumors cor­rect­ly as pos­si­ble with­out car­ing much about over­all accu­ra­cy or whether we mis­clas­si­fy some more benign one as being malig­nant. This is because the goal of such a clas­si­fi­ca­tion prob­lem should be to iden­ti­fy all malig­nant cas­es and inter­vene for patient care.

So, for prob­lems where we con­sid­er both class­es to be of equal impor­tance we can con­sid­er 0.5 prob­a­bil­i­ty as our thresh­old val­ue (any­thing above or equal to 0.5 is one class and any­thing below is anoth­er class). But for prob­lems like above, if we can choose a thresh­old such that it also cov­ers the left out malig­nant ones, that would be great. And as I will show, choos­ing a dif­fer­ent thresh­old basi­cal­ly means to move our curve/line of sep­a­ra­tion a lit­tle bit.

But before that here’s some math. You will note the fol­low­ing things from the image below -

  • Curve/line of sep­a­ra­tion cor­re­sponds to z = 0, which hap­pens at prob­a­bil­i­ty of 0.5. For ease of illus­tra­tion pur­pos­es y = x has been cho­sen
  • Region above curve/line of sep­a­ra­tion is where z is +ve and cor­re­sponds to region where prob­a­bil­i­ty is larg­er than 0.5. Illus­trat­ed by point (3, 5)
  • Region below curve/line of sep­a­ra­tion is where z is ‑ve and cor­re­sponds to region where prob­a­bil­i­ty is less than 0.5. Illus­trat­ed by point (5, 3)

So, as seen from above prob­a­bil­i­ty of 0.5 cor­re­sponds to like y — x = 0

What about prob­a­bil­i­ty of 0.4 or less? What does that mean geo­met­ri­cal­ly?
So, we know that any­thing less than 0.5 means we’re talk­ing about the region below the line. A prob­a­bil­i­ty p < 0.5 means that z is neg­a­tive. For illus­tra­tion pur­pos­es, let’s just assume some prob­a­bil­i­ty p(<0.5, maybe it’s 0.4 or less­er) such that val­ue of z is ‑0.7 (we can see from sig­moid func­tion that z will be neg­a­tive when prob­a­bil­i­ty goes below 0.5). What does z = ‑0.7 mean for the curve?
Well, yes, of course it means y — x = ‑0.7 or y = x — 0.7, which is basi­cal­ly anoth­er line shift­ed down a bit from the orig­i­nal line, as shown below -

So, essen­tial­ly set­ting thresh­old to any­thing but 0.5 shifts the line or curve. And now, we can see the ben­e­fit of such shift­ing — the pre­vi­ous­ly malig­nant tumors which were clas­si­fied as benign are now being clas­si­fied cor­rect­ly. So, the idea is to shift the line/curve such that it cap­tures what we want to do for our prob­lem — in this case we want­ed to increase True Pos­i­tive Rate, with­out being too nosy about over­all accu­ra­cy. So, now the ques­tion is — how do we actu­al­ly pick a good thresh­old? And that is exact­ly what the RoC (Receiv­er Oper­a­tive Curve) lets us do.

RoC Curve

Now that we know why we have reached here, the idea is pret­ty sim­ple — try out a range of prob­a­bil­i­ties (shift your curve/line a lot above z = 0 and below z = 0) and for each prob­a­bil­i­ty, cap­ture the True Pos­i­tive Rate (gen­er­al­ly you want this to increase) as well as the False Pos­i­tive Rate (gen­er­al­ly you want this to decrease). Now, plot these out. Some­thing like below, and choose the val­ue of prob­a­bil­i­ty thresh­old that makes more sense for your prob­lem. Also, you can use RoC curve to find out AuC (Area Under the Curve) to get a sense of how your mod­el is doing. You can also plot out var­i­ous mod­els’ RoC curves and then see which mod­el and which prob­a­bil­i­ty thresh­old makes most sense for your prob­lem -

Cross Validation — Intuitively Explained

Cross Val­i­da­tion is not a dif­fi­cult top­ic. But when it comes to under­stand­ing how to get the tun­ing para­me­ter using Cross Val­i­da­tion, a lot of peo­ple get con­fused. Hope­ful­ly, this blog might help out a lit­tle bit.

Let’s start from the begin­ning. 

What is Cross Val­i­da­tion?

Cross Val­i­da­tion is a gen­er­al tech­nique used to iden­ti­fy the bet­ter per­form­ing mod­el out of a bunch of giv­en mod­els. 

Let’s say we have some data and we divide it into train and test,
some­thing like this — 


But, why only this 25% is used to test? Why not the start­ing 25%? Even if we ran­dom­ly take a cer­tain 25% data, then why only that? The point is that if we train a mod­el with a cer­tain 75% data and use a cer­tain 25% data for test­ing, then we’ve intro­duced data bias in our mod­el — it works well for ‘that’ 75% and ‘that’ 25%. It also begs the fol­low­ing ques­tions — is only a cer­tain 25% of data good for test­ing and only a cer­tain 75% data good for train­ing?

Would­n’t it be bet­ter if we some­how get to lever­age the whole data set for test­ing as well as train­ing? 

And this is where K‑fold Cross Val­i­da­tion comes into play.

Basic Idea

Let’s say we have some data like so -

D = {x1y1, x2y2, .… xnyn}

and some ML mod­els m (1, 2,..c)
By the way, are these mod­els yet? I guess not! These are not ready yet to do any pre­dic­tions. They are sim­ply a con­fig­u­ra­tion of pre­dic­tor and fea­ture vari­ables. They will only become a mod­els once they pass through a data set. So, these m algo­rithms can be any mix of algo­rithms that you think should solve your prob­lem — like for clas­si­fi­ca­tion prob­lem these can be logis­tic regres­sion, SVM, Neur­al Nets, etc.

Any­how, so here’s what K‑fold CV does -

Step 1 — Set some val­ue of K. 5 or 10 are very com­mon choic­es. Now, shuffle/permute the data ran­dom­ly once.

Step 2 — Split the data into K folds (let’s say 5).

Step 3 — This is just for illus­tra­tion pur­pos­es. Since there are k‑folds (5 in this case), each ML algo­rithm will go through k iter­a­tions of train­ing — test­ing






For each of the iter­a­tions shown above, we have a test set and the rest is train­ing set. Now, for each algo­rithm m in your set — 

  1. Train the mod­el on Iter­a­tion 1 train­ing set and get the error by using its test set.
  2. Train the mod­el on Iter­a­tion 2 train­ing set and get the error by using its test set.

    .…. and so on for K iter­a­tions. Now, find the aver­age error as (1/k) x (sum of all errors). Note that these will be mis­clas­si­fied cas­es for clas­si­fi­ca­tion prob­lems and resid­u­als for regres­sion prob­lems. Instead, we can also find out accu­ra­cy for clas­si­fi­ca­tion prob­lems.

We repeat the above for all our algo­rithms and choose the one that has low­est aver­age error (or high­est accu­ra­cy).

So, what do we gain by this exer­cise? Just this — If there was going to be some bias in our final mod­el due to data selec­tion bias, then we have got­ten rid of that and have hope­ful­ly select­ed a bet­ter mod­el.

How can K‑Fold CV be used for Com­plex­i­ty Para­me­ter?

When we build a tree, we run the chance of build­ing a big, over­fit­ted tree. Think about it, if there’s noth­ing stop­ping tree gen­er­a­tion then it will try to fit the train data as best as pos­si­ble. Over­fit­ting would sim­ply mean that pre­dic­tion might not be good for test data. So, how do we go around this prob­lem? Well, we need to ‘prune’ our trees. 

And so what’s prun­ing? It’s merg­ing of nodes to make the tree short­er. As shown below, the tree is pruned at t2 node. 












As you can guess, the more pruned a tree is, the more Sum of Squared Error (in case of regres­sion trees) or more mis­clas­si­fi­ca­tion error (in case of clas­si­fi­ca­tion trees) it would have — tree with just node would have the most error (most under­fit tree) and the largest tree would have the least error (most over­fit tree). The job of prun­ing is to find the bal­ance here — so that we are able to iden­ti­fy a tree mod­el that does not over or under fit. And prun­ing does this by some­thing called Cost Com­plex­i­ty Prun­ing.




In sim­ple words, the idea is this — we add this extra term α|T | (Tree Cost Com­plex­i­ty) to the total cost and we seek to min­i­mize the over­all cost. |T| is the total num­ber of ter­mi­nal nodes in the tree and α is the com­plex­i­ty para­me­ter. In oth­er words, this term is basi­cal­ly penal­iz­ing big trees — when the num­ber of leaves in the tree increase by one, the cost increas­es by α. Depend­ing on the val­ue of α(≥ 0) a com­plex tree that makes no errors may now have a high­er total cost than a small tree that makes a num­ber of errors! This is how this term enables us to find a good tree. 

Also, con­vince your­self that for a giv­en α, there can only be one tree T that will min­i­mize the loss func­tion. 

But. How to find  α?

And this is where CV comes in. Here’s what the APIs do as an overview (of course, there would be dif­fer­ences in imple­men­ta­tions and more sub­tleties but the point here is to get the intu­ition right) — 

We take a range of α’s we want to try out. For each of there α’s and a giv­en data set and pre-deter­mined val­ue of K(for CV), we do the fol­low­ing — 

  • For α1, find the best tree for Fold 1
  • For α1, find the best tree for Fold 2
  • ..and so on, For α1, find the best tree for Fold K
  • For α2, find the best tree for Fold 1
  • For α2, find the best tree for Fold 2
  • ..and so on, For α2, find the best tree for Fold K
  • …and so on, till what­ev­er num­ber of α’s you want to try out (typ­i­cal­ly try­ing out any­thing between 30 — 50 are prob­a­bly enough)

So, for each α, we find the aver­age accu­ra­cy of the mod­el (accu­ra­cy in case of clas­si­fi­ca­tion mod­els or RMSE/SSE in case of regres­sions mod­els. Let’s go with Accu­ra­cy for illus­tra­tion pur­pos­es). We can plot this out like below (in prac­tice you won’t need to plot it out and the CV api can tell you the best val­ue of alpha) and see for what val­ue of α we get the high­est accu­ra­cy and we choose that α and plug in our orig­i­nal tree’s cost func­tion and build a good tree! Note that CV is not gen­er­al­ly used for Ran­dom Forests or GBM kind of algo­rithms — this is because CV won’t be much effec­tive there since these algo­rithms already have a lot of ran­dom­ness built into them, so chance over­fit­ting of great­ly reduced. 

Also, note that the shape of the graph below is such because when val­ue of α is low­er then it will favor big­ger trees and accu­ra­cy will be high (if not high­est), and when it’s val­ue keeps get­ting big­ger then it favor more and more short trees, the short­er the tree the less­er it’s accu­ra­cy. A sweet spot is, there­fore, some­where in the mid­dle.

So, once we find our α, we can use it to build our new tree mod­el rather than using arbi­trary para­me­ters like min­buck­et, etc. to lim­it our trees.

Think about what CV did here — we had a bunch of mod­els we built using dif­fer­ent val­ues of α, we used CV on them to find a good val­ue of α and then even­tu­al­ly use α to find a good sin­gle tree mod­el. So, this is just an appli­ca­tion of what CV does in gen­er­al — to find a good mod­el, giv­en a bunch of mod­els.

Simple Single Document Summarizer

A few years back I cre­at­ed a Sin­gle Doc­u­ment Sum­ma­riz­er — a sta­tis­ti­cal heuris­tics dri­ven Java appli­ca­tion (with GUI) that takes in cer­tain text and sum­ma­rizes it. In this blog post, I am going to dis­cuss some tech­niques I used for sum­ma­riza­tion. These are rudi­men­ta­ry tech­niques but still work well.

Code can be found at — (might have a few repet­i­tive direc­to­ries as well)

Sen­tence extrac­tion is done via regex, after which nor­mal NLP pre­pro­cess­ing is done — Stem­ming (using Porter’s rule-based algo­rithm) and Stop words removal. After tok­eniza­tion, we have a sen­tences to tokens map­ping. This is all pret­ty straight­for­ward. I also used DUC 2002 (Doc­u­ment Under­stand­ing Con­fer­ence) cor­pus to test the algo­rithm for it’s effi­ca­cy but all that’s for the paper.

Now, let’s dis­cuss the fea­tures used to give weights to sen­tences -

  • Top­ic Seg­men­ta­tion Words — If we are able to find impor­tant words that cor­re­spond to sub top­ics in the text, then sen­tences con­tain­ing those words are prob­a­bly impor­tant and should be giv­en more weight.  This is also impor­tant to get cov­er­age of the top­ics writ­ten about in the text.

For find­ing such words we have used tf-isf (term fre­quence — inverse sen­tence fre­quen­cy) along with word den­si­ty score dens(w).

The tf.isf score is -


The above equa­tion cal­cu­lates tf.isf for word w in sen­tence s. stf (w, s)/|s| is the nor­mal­ized fre­quen­cy of word w in sen­tence s where |s| is the total num­ber of words in s. The log term is the inverse sen­tence fre­quen­cy. Here, Ns is the total num­ber of sen­tences in the doc­u­ment and sf (w) is the num­ber of sen­tences in which w occurs at least once. This gives us a hold over the dis­tri­b­u­tion of the word through­out the doc­u­ment. If a word occurs very often through­out the doc­u­ment then its isf score would be low and if it occurs in only a few places then its isf score would be high. The intu­ition here is that the words that appear in only a few places are bet­ter can­di­dates for top­ic seg­men­ta­tion.
But there’s a prob­lem. Even if a word appears at very few places (i.e., it has good isf) but the places where it appears are very far off in the text then that word is also not well suit­ed for top­ic seg­men­ta­tion. The intu­ition here is that if a word is rep­re­sen­ta­tive of a sub-top­ic then it should occur fre­quent­ly in a spe­cif­ic region and infre­quent­ly in oth­er parts. For this pur­pose we’ll cal­cu­late the word den­si­ty as fol­lows:


Here, occur (k) and occur (k+1) rep­re­sent the con­sec­u­tive posi­tions of w in the text and the dist func­tion cal­cu­lates the dis­tance between them in terms of words. |w| is the total num­ber of occur­rences of w in the doc­u­ment. When we sum up these inverse dis­tances we get a high­er val­ue for words that are dense in a region and low­er val­ues for dis­persed words.

Com­bin­ing these two, we have — tf.isf (w, s) x dens (w)

  • Sen­tence Loca­tion — The intu­ition here being that start­ing and end­ing sen­tences are prob­a­bly impor­tant. So, we nor­mal­ize the sen­tence posi­tions between 0 and 1 and give more weight to start­ing and end­ing sen­tences.
  • Posi­tion of next sen­tence — Posi­tion of a sen­tence may also have effect. For exam­ple — “Sachin is an excel­lent bats­man. He lives in Mum­bai. He has played a lot of crick­et. A lot of crick­et is played in India”. Here we are talk­ing about a play­er named “Sachin”. The first sen­tence is impor­tant due to the fact that we con­tin­ue to talk about “Sachin” in the sec­ond sen­tence. The sec­ond sen­tence also finds its rel­e­vance since we con­tin­ue to talk about “Sachin” in the third sen­tence too. Now the third sen­tence is not as impor­tant as the first and sec­ond ones since the fourth sen­tence does not talk about “Sachin”. So, let’s give impor­tance to a sen­tence if the sen­tence fol­low­ing it refers to it. To have used the approach of iden­ti­fy­ing cue words like ‘alter­na­tive­ly’, ‘although’, ‘accord­ing­ly’ etc. to find such sen­tences. The for­mu­la used is -

weight = num­ber of cue phras­es in the sen­tence / total num­ber of words in sen­tence

Also, if a sen­tence has a sen­tence fol­low­ing it in the same para­graph which starts with a pro­noun, then we also add 0.1 to the weight of the sen­tence.

  • Title words — If the sen­tence has words that are used in the title (except stop words) then the sen­tence maybe indica­tive about what the text is about. As such, we give more weights to such sen­tences.
  • Theme words — We try to find words that are rep­re­sen­ta­tive of the themes present in the text. For this, we sort the nor­mal­ized term fre­quen­cies and take the first 4 words with high­est fre­quen­cies. So, now the sen­tence weight for this fea­ture is -

weight = num­ber of theme words in the sen­tence / total num­ber of words in sen­tence

What are nor­mal­ized term fre­quen­cies? For a giv­en term t1, the nor­mal­ized term fre­quen­cy         tf1 = total fre­quen­cy of t1i / max fre­quen­cy

  • Prop­er Nouns — It can be said that sen­tences con­tain­ing prop­er nouns have more infor­ma­tion to con­vey. So, we can use this to add weight to sen­tences too.

weight = num­ber of prop­er nouns in sen­tence / total num­ber of words in sen­tence

  • Sen­tence Length — Length­i­er sen­tences con­tain more infor­ma­tion.

weight = num­ber of words in sen­tence / max sen­tence length in doc­u­ment

  • Punc­tu­a­tion — Cer­tain punc­tu­a­tion are impor­tant to iden­ti­fy impor­tant sen­tences. For exam­ple, excla­ma­tion mark (!) may mean some sud­den thought or emo­tion. Sim­i­lar­ly, a ques­tion fol­lowed by an answer should also have good infor­ma­tion.

weight = total punc­tu­a­tion in the sen­tence / total words in sen­tence

We’ve omit­ted some punc­tu­a­tion as well as ? and ! have been giv­en more impor­tance (adding 1.0 for each of these)

  • Numer­ic Data — Sen­tences con­tain­ing numer­i­cal data can be impor­tant. They can have impor­tant sta­tis­tics.

weight = total numer­i­cal data in sen­tence / total words in sen­tence

After we have these indi­vid­ual weights for each sen­tence, we com­bine them using a lin­ear com­bi­na­tion to find total sen­tence weight: α(sentence loca­tion weight) + β(weight due to next sen­tence) +γ(title word weight) + δ(term fre­quen­cy weight) + ε(theme word weight) + ζ(weight due to prop­er nouns) +η(weight due to cue phras­es) +θ(weight due to top­ic seg­men­ta­tion words) +ι(weight due to sen­tence length) +κ(weight due to punc­tu­a­tion) +λ(weitght due to numer­ic data), where the greek let­ters have weights between 0 and 1 and can be tweaked to influ­ence the weight of fea­tures.

Once this is done we can select the top x % of rank­ing sen­tences and then re-arrange them in sum­ma­ry in the same order they were in the orig­i­nal text. Here, x can be user input.