Category Archives: Machine Learning

Why ROC? — Intuitive Explanation

I’ve often found myself under­stand­ing con­cepts much bet­ter when I under­stand the geo­met­ri­cal inter­pre­ta­tion behind that con­cept. This is espe­cial­ly true for a lot of Math, Machine Learn­ing and AI (although there are def­i­nite­ly a lot of abstract con­cepts in this field too that often don’t have a geo­met­ri­cal inter­pre­ta­tion asso­ci­at­ed with them).

There might be many posts out there on RoC and what it is and how it’s used to deter­mine the prob­a­bil­i­ty thresh­old for a clas­si­fi­ca­tion prob­lem, but how is the geom­e­try of the prob­lem changes when we pick anoth­er thresh­old val­ue is some­thing I’ve nev­er encoun­tered any­where. And this post, I am going to try and intu­itive­ly explain every­thing. ELIF, or what’s the point, right?

Prob­lem State­ment

Let’s start with a sim­ple exam­ple that’s often used. Let’s say we are faced with pre­dict­ing whether a tumor is benign or malig­nant, and assume we have a bunch of fea­tures avail­able to us to do this task. For illus­tra­tion pur­pos­es, I am going to keep the graphs and equa­tions real­ly sim­ple, because the point it to under­stand the con­cept to be able to apply it lat­er in more com­plex sce­nar­ios.

So, we go ahead and build a clas­si­fi­ca­tion mod­el and the line/curve it comes up with to sep­a­rate the cat­e­gories is shown below -

So, as you can see above the line does a pret­ty job of clas­si­fi­ca­tion. The accu­ra­cy is high too. So, do we care about any­thing more? Yep! In the case above, there are two malig­nant tumors that have been clas­si­fied as benign. And that’s not a good thing for this par­tic­u­lar prob­lem. It depends on the prob­lem real­ly, whether we want our cri­te­ria of suc­cess to be over­all accu­ra­cy or true pos­i­tive rate or false neg­a­tive rate, etc. For a prob­lem like above, we want to def­i­nite­ly clas­si­fy as much malig­nant tumors cor­rect­ly as pos­si­ble with­out car­ing much about over­all accu­ra­cy or whether we mis­clas­si­fy some more benign one as being malig­nant. This is because the goal of such a clas­si­fi­ca­tion prob­lem should be to iden­ti­fy all malig­nant cas­es and inter­vene for patient care.

So, for prob­lems where we con­sid­er both class­es to be of equal impor­tance we can con­sid­er 0.5 prob­a­bil­i­ty as our thresh­old val­ue (any­thing above or equal to 0.5 is one class and any­thing below is anoth­er class). But for prob­lems like above, if we can choose a thresh­old such that it also cov­ers the left out malig­nant ones, that would be great. And as I will show, choos­ing a dif­fer­ent thresh­old basi­cal­ly means to move our curve/line of sep­a­ra­tion a lit­tle bit.

But before that here’s some math. You will note the fol­low­ing things from the image below -

  • Curve/line of sep­a­ra­tion cor­re­sponds to z = 0, which hap­pens at prob­a­bil­i­ty of 0.5. For ease of illus­tra­tion pur­pos­es y = x has been cho­sen
  • Region above curve/line of sep­a­ra­tion is where z is +ve and cor­re­sponds to region where prob­a­bil­i­ty is larg­er than 0.5. Illus­trat­ed by point (3, 5)
  • Region below curve/line of sep­a­ra­tion is where z is ‑ve and cor­re­sponds to region where prob­a­bil­i­ty is less than 0.5. Illus­trat­ed by point (5, 3)

So, as seen from above prob­a­bil­i­ty of 0.5 cor­re­sponds to like y — x = 0

What about prob­a­bil­i­ty of 0.4 or less? What does that mean geo­met­ri­cal­ly?
So, we know that any­thing less than 0.5 means we’re talk­ing about the region below the line. A prob­a­bil­i­ty p < 0.5 means that z is neg­a­tive. For illus­tra­tion pur­pos­es, let’s just assume some prob­a­bil­i­ty p(<0.5, maybe it’s 0.4 or less­er) such that val­ue of z is ‑0.7 (we can see from sig­moid func­tion that z will be neg­a­tive when prob­a­bil­i­ty goes below 0.5). What does z = ‑0.7 mean for the curve?
Well, yes, of course it means y — x = ‑0.7 or y = x — 0.7, which is basi­cal­ly anoth­er line shift­ed down a bit from the orig­i­nal line, as shown below -

So, essen­tial­ly set­ting thresh­old to any­thing but 0.5 shifts the line or curve. And now, we can see the ben­e­fit of such shift­ing — the pre­vi­ous­ly malig­nant tumors which were clas­si­fied as benign are now being clas­si­fied cor­rect­ly. So, the idea is to shift the line/curve such that it cap­tures what we want to do for our prob­lem — in this case we want­ed to increase True Pos­i­tive Rate, with­out being too nosy about over­all accu­ra­cy. So, now the ques­tion is — how do we actu­al­ly pick a good thresh­old? And that is exact­ly what the RoC (Receiv­er Oper­a­tive Curve) lets us do.

RoC Curve

Now that we know why we have reached here, the idea is pret­ty sim­ple — try out a range of prob­a­bil­i­ties (shift your curve/line a lot above z = 0 and below z = 0) and for each prob­a­bil­i­ty, cap­ture the True Pos­i­tive Rate (gen­er­al­ly you want this to increase) as well as the False Pos­i­tive Rate (gen­er­al­ly you want this to decrease). Now, plot these out. Some­thing like below, and choose the val­ue of prob­a­bil­i­ty thresh­old that makes more sense for your prob­lem. Also, you can use RoC curve to find out AuC (Area Under the Curve) to get a sense of how your mod­el is doing. You can also plot out var­i­ous mod­els’ RoC curves and then see which mod­el and which prob­a­bil­i­ty thresh­old makes most sense for your prob­lem -

Cross Validation — Intuitively Explained

Cross Val­i­da­tion is not a dif­fi­cult top­ic. But when it comes to under­stand­ing how to get the tun­ing para­me­ter using Cross Val­i­da­tion, a lot of peo­ple get con­fused. Hope­ful­ly, this blog might help out a lit­tle bit.

Let’s start from the begin­ning. 

What is Cross Val­i­da­tion?

Cross Val­i­da­tion is a gen­er­al tech­nique used to iden­ti­fy the bet­ter per­form­ing mod­el out of a bunch of giv­en mod­els. 

Let’s say we have some data and we divide it into train and test,
some­thing like this — 

 

But, why only this 25% is used to test? Why not the start­ing 25%? Even if we ran­dom­ly take a cer­tain 25% data, then why only that? The point is that if we train a mod­el with a cer­tain 75% data and use a cer­tain 25% data for test­ing, then we’ve intro­duced data bias in our mod­el — it works well for ‘that’ 75% and ‘that’ 25%. It also begs the fol­low­ing ques­tions — is only a cer­tain 25% of data good for test­ing and only a cer­tain 75% data good for train­ing?

Would­n’t it be bet­ter if we some­how get to lever­age the whole data set for test­ing as well as train­ing? 

And this is where K‑fold Cross Val­i­da­tion comes into play.

Basic Idea

Let’s say we have some data like so -

D = {x1y1, x2y2, .… xnyn}

and some ML mod­els m (1, 2,..c)
By the way, are these mod­els yet? I guess not! These are not ready yet to do any pre­dic­tions. They are sim­ply a con­fig­u­ra­tion of pre­dic­tor and fea­ture vari­ables. They will only become a mod­els once they pass through a data set. So, these m algo­rithms can be any mix of algo­rithms that you think should solve your prob­lem — like for clas­si­fi­ca­tion prob­lem these can be logis­tic regres­sion, SVM, Neur­al Nets, etc.

Any­how, so here’s what K‑fold CV does -

Step 1 — Set some val­ue of K. 5 or 10 are very com­mon choic­es. Now, shuffle/permute the data ran­dom­ly once.

Step 2 — Split the data into K folds (let’s say 5).

Step 3 — This is just for illus­tra­tion pur­pos­es. Since there are k‑folds (5 in this case), each ML algo­rithm will go through k iter­a­tions of train­ing — test­ing

 

 

 
 
 
 

 

 

For each of the iter­a­tions shown above, we have a test set and the rest is train­ing set. Now, for each algo­rithm m in your set — 

  1. Train the mod­el on Iter­a­tion 1 train­ing set and get the error by using its test set.
  2. Train the mod­el on Iter­a­tion 2 train­ing set and get the error by using its test set.

    .…. and so on for K iter­a­tions. Now, find the aver­age error as (1/k) x (sum of all errors). Note that these will be mis­clas­si­fied cas­es for clas­si­fi­ca­tion prob­lems and resid­u­als for regres­sion prob­lems. Instead, we can also find out accu­ra­cy for clas­si­fi­ca­tion prob­lems.

We repeat the above for all our algo­rithms and choose the one that has low­est aver­age error (or high­est accu­ra­cy).

So, what do we gain by this exer­cise? Just this — If there was going to be some bias in our final mod­el due to data selec­tion bias, then we have got­ten rid of that and have hope­ful­ly select­ed a bet­ter mod­el.

How can K‑Fold CV be used for Com­plex­i­ty Para­me­ter?

When we build a tree, we run the chance of build­ing a big, over­fit­ted tree. Think about it, if there’s noth­ing stop­ping tree gen­er­a­tion then it will try to fit the train data as best as pos­si­ble. Over­fit­ting would sim­ply mean that pre­dic­tion might not be good for test data. So, how do we go around this prob­lem? Well, we need to ‘prune’ our trees. 

And so what’s prun­ing? It’s merg­ing of nodes to make the tree short­er. As shown below, the tree is pruned at t2 node. 

 

 

 

 

 

 

 

 

 

 

 

As you can guess, the more pruned a tree is, the more Sum of Squared Error (in case of regres­sion trees) or more mis­clas­si­fi­ca­tion error (in case of clas­si­fi­ca­tion trees) it would have — tree with just node would have the most error (most under­fit tree) and the largest tree would have the least error (most over­fit tree). The job of prun­ing is to find the bal­ance here — so that we are able to iden­ti­fy a tree mod­el that does not over or under fit. And prun­ing does this by some­thing called Cost Com­plex­i­ty Prun­ing.

 

 

 

In sim­ple words, the idea is this — we add this extra term α|T | (Tree Cost Com­plex­i­ty) to the total cost and we seek to min­i­mize the over­all cost. |T| is the total num­ber of ter­mi­nal nodes in the tree and α is the com­plex­i­ty para­me­ter. In oth­er words, this term is basi­cal­ly penal­iz­ing big trees — when the num­ber of leaves in the tree increase by one, the cost increas­es by α. Depend­ing on the val­ue of α(≥ 0) a com­plex tree that makes no errors may now have a high­er total cost than a small tree that makes a num­ber of errors! This is how this term enables us to find a good tree. 

Also, con­vince your­self that for a giv­en α, there can only be one tree T that will min­i­mize the loss func­tion. 

But. How to find  α?

And this is where CV comes in. Here’s what the APIs do as an overview (of course, there would be dif­fer­ences in imple­men­ta­tions and more sub­tleties but the point here is to get the intu­ition right) — 

We take a range of α’s we want to try out. For each of there α’s and a giv­en data set and pre-deter­mined val­ue of K(for CV), we do the fol­low­ing — 

  • For α1, find the best tree for Fold 1
  • For α1, find the best tree for Fold 2
  • ..and so on, For α1, find the best tree for Fold K
  • For α2, find the best tree for Fold 1
  • For α2, find the best tree for Fold 2
  • ..and so on, For α2, find the best tree for Fold K
  • …and so on, till what­ev­er num­ber of α’s you want to try out (typ­i­cal­ly try­ing out any­thing between 30 — 50 are prob­a­bly enough)

So, for each α, we find the aver­age accu­ra­cy of the mod­el (accu­ra­cy in case of clas­si­fi­ca­tion mod­els or RMSE/SSE in case of regres­sions mod­els. Let’s go with Accu­ra­cy for illus­tra­tion pur­pos­es). We can plot this out like below (in prac­tice you won’t need to plot it out and the CV api can tell you the best val­ue of alpha) and see for what val­ue of α we get the high­est accu­ra­cy and we choose that α and plug in our orig­i­nal tree’s cost func­tion and build a good tree! Note that CV is not gen­er­al­ly used for Ran­dom Forests or GBM kind of algo­rithms — this is because CV won’t be much effec­tive there since these algo­rithms already have a lot of ran­dom­ness built into them, so chance over­fit­ting of great­ly reduced. 

Also, note that the shape of the graph below is such because when val­ue of α is low­er then it will favor big­ger trees and accu­ra­cy will be high (if not high­est), and when it’s val­ue keeps get­ting big­ger then it favor more and more short trees, the short­er the tree the less­er it’s accu­ra­cy. A sweet spot is, there­fore, some­where in the mid­dle.

So, once we find our α, we can use it to build our new tree mod­el rather than using arbi­trary para­me­ters like min­buck­et, etc. to lim­it our trees.

Think about what CV did here — we had a bunch of mod­els we built using dif­fer­ent val­ues of α, we used CV on them to find a good val­ue of α and then even­tu­al­ly use α to find a good sin­gle tree mod­el. So, this is just an appli­ca­tion of what CV does in gen­er­al — to find a good mod­el, giv­en a bunch of mod­els.

Understanding GBM intuitively

A few years back, I was assigned the task of Pre­dict­ing the prob­a­bil­i­ty of stu­dents drop­ping out of col­lege so that there could be some inter­ven­tion before­hand to help stu­dents at risk.
For this exer­cise I had to first build a Per­son Match­ing Algo­rithm between cred­it card data and the stu­dents data to be able to find out the finan­cial con­di­tion of the stu­den­t’s fam­i­ly — since a lot of times this is also a fac­tor for drop­ping out. But that’s a top­ic for anoth­er day.

In this post, I am going to write a lit­tle bit about the main ML algo­rithm I used — GBM or Gra­di­ent Boost­ed Machine learn­ing. It’s also one of the top algo­rithms used in Kag­gle com­pe­ti­tions and is pret­ty reli­able. When I used this algo­rithm, I was already well-versed with Deci­sion Trees, Ran­dom Forests, etc. but I still had a hard time get­ting the intu­ition behind GBM and why it would work and why it was designed the way it is. In this post, I am going to try to break this down for the read­er, but the post assumes that the read­er is already famil­iar with how Deci­sion Trees (Regres­sion as well as Clas­si­fi­ca­tion) are grown.

Before we dive into GBM, we can dis­cuss a lit­tle bit about how the world got here -

  1. Sin­gle Trees — First came the Deci­sion Trees. These are sin­gle trees made from train­ing data and are only good for visu­al rep­re­sen­ta­tion of the prob­lem. Nev­er use them in pro­duc­tion because they are very sen­si­tive to data and the pre­dic­tion val­ues change very eas­i­ly for even slight vari­a­tions in train­ing data. Also, def­i­nite­ly not use sim­ple trees when data is skewed in favor of one class — it will almost always pre­dict that same class.
  2. Bag­ging — Due to prob­lems aris­ing from sin­gle trees, peo­ple start­ed to look at — well, what do peo­ple look at when sin­gle things don’t work well? Cre­ate many mod­els and aver­age out the result. That’s called Bag­ging. It comes from the phrase ‘boot­strap aggregating’. There are two tech­niques involved here — boot­strap­ping (of data) and aggre­gat­ing (or aver­ag­ing of mod­els). In the boot­strap­ping phase, a num­ber of train­ing data sets are cre­at­ed by ‘sam­pling and replace­ment’! What does it mean? Well, it sim­ply means that a data set can be cre­at­ed from orig­i­nal data set for train­ing where a per­cent­age of records will be ran­dom­ly sam­pled and since sam­pling is not exclud­ing any­thing (even records drawn ear­li­er), there could be a lot of records that are repeat­ed. And we cre­ate many such data sets — as you can see this intro­duc­tion of ran­dom­ness in data basi­cal­ly reduces the pos­si­bil­i­ty of over­fit­ting. That was for the datasets. Now, for the mod­el — instead of choos­ing just one mod­el, one can cre­ate a num­ber of mod­els with cer­tain vari­a­tions. Now, these m num­ber of mod­els can be fit­ted with m num­ber of new datasets and result aver­aged. Ran­dom for­est is basi­cal­ly just this. To cre­ate the dif­fer­ent mod­els, when it builds each tree, for each split only a cer­tain num­ber of fea­tures are cho­sen ran­dom­ly.
  3. Boost­ing — Ran­dom forests work well. The ran­dom­ness in data and ran­dom­ness in mod­els mean that the chance of over­fit­ting is great­ly reduced. But does that nec­es­sar­i­ly mean a good mod­el? Not real­ly! And the rea­son is quite sim­ple. Ran­dom for­est is basi­cal­ly tak­ing a major­i­ty vote (or aver­age val­ue in case of a ‘regres­sion trees’ for­est) from a num­ber of inde­pen­dent­ly cre­at­ed mod­els. That’s right. The trees in RF are not relat­ed to each oth­er at all. So, there’s no chance that they are learn­ing from each oth­er’s mis­takes. That’s where Boost­ing comes in! It’s a tech­nique where a num­ber of mod­els are cre­at­ed and all are used to get the final vote (or val­ue), but the each new mod­el is built by improv­ing on the pre­vi­ous mod­el. So, boost­ing is basi­cal­ly say­ing — ‘I have this stu­pid ini­tial mod­el, which is basi­cal­ly an aver­age mod­el. Now, let’s build anoth­er model/tree based on this such that it’s a lit­tle bet­ter. And let’s keep doing this till we (let’s say) build 100 mod­els’!

In short, here’s what GBM does — it com­putes an ini­tial sin­gle pre­dic­tion val­ue. Then it com­putes resid­u­als for pre­dict­ed and actu­al val­ues, then it builds a tree to pre­dict the resid­u­als. Now, it replaces the ini­tial pre­dic­tions with new ones — with new pre­dic­tion val­ue being: ini­tial pre­dic­tion val­ue + pre­dict­ed resid­ual val­ue. Once we get new pre­dic­tion val­ues, we again cal­cu­late new resid­u­als and build anoth­er tree to pre­dict these resid­u­als. We keep doing this for M num­ber of trees, where M can be user input. As you can see, any tree that’s gen­er­at­ed depends on ear­li­er pre­dic­tion val­ue. This is why GBM is said to be addi­tive and it’s an ensem­bling tech­nique. This is also the main idea behind boost­ing — using pre­vi­ous pre­dic­tions to come up with new mod­els. In our case, if you see close­ly, we build our first tree based on the resid­u­als from the first pre­dic­tion, we build our sec­ond tree based on the resid­u­als from first tree, and so on.

Here’s what GBM looks like -

This can seem daunt­ing at first, but it’s not.

Here’s step 0 — Let’s assume we have some train­ing data avail­able to us of the form (x, y), for i=1 to n, where the y is some regres­sor (let’s just take regres­sion as an exam­ple to under­stand GBM. Clas­si­fi­ca­tion is sim­i­lar but needs some trans­for­ma­tions between steps). x here is not one fea­ture but all val­ues for all fea­tures for a sin­gle record and n is the total num­ber of records. Here’s a visu­al rep­re­sen­ta­tion of this data -

Now, let’s get to step 1.
This step is ask­ing us to cal­cu­late f0(x). This is a sin­gle val­ue. As a mat­ter of fact, GBM starts with a sin­gle pre­dic­tion val­ue. If you were giv­en a bunch of data and asked to give a sin­gle val­ue that you would pre­dict for that data, what would it be? You’ll sim­ply take aver­age of y val­ues from train­ing data and for any incom­ing new data, this is what you’ll pre­dict. As we’ll see, f0(x) here, for regres­sion is basi­cal­ly the aver­age of all y val­ues in the data.

L is the loss func­tion that we need to min­i­mize. Noth­ing new here, all ML algo­rithms min­i­mize a loss/cost func­tion. Input for L is y val­ues from train­ing data and the pre­dict­ed val­ue γ. Note that since we only need one pre­dic­tion val­ue, this is going to be same for all for all records. Now, for regres­sion loss func­tion usu­al­ly tak­en is — 1/2 x (actu­al — predicted)^2

There are oth­er Loss func­tions too for regres­sion that one could use but let’s go with sum of squares of resid­u­als, which is also very com­mon. The Loss func­tion must be dif­fer­en­tiable and must be con­vex so that we can apply gra­di­ent descent to it. And we know that sum of squares of resid­u­als fol­lows these prop­er­ties. Below is a rough graph of one fea­ture vari­able based loss func­tion against pre­dic­tion val­ues. The shape of the curve is such because of the square term that we have in our loss func­tion. Note that now we need to take deriv­a­tive of it and set that to 0, to get to the point where we have the pink line touch­ing the graph hor­i­zon­tal­ly, since that’s the point where the val­ue of loss func­tion would be low­est.

If you com­pute the derivate, it will come out to be -> -(actu­al — pre­dict­ed). Now, let’s say we have 3 records where y was 1, 4, 5. Putting these in the equa­tion and set­ting it to zero we get, we get -> -(1 — pre­dict­ed) + (-(4 — pre­dict­ed)) + (-(5 — pre­dict­ed)) = 0. Solv­ing this for pre­dict­ed, we get ->

pre­dict­ed = (1 + 4 + 5) / 3
So, f0(x) is noth­ing but aver­age of all y val­ues.

Let’s move to step 2.
In this step, M is the total num­ber of trees we want to build. This is usu­al­ly a user input, like 100 (or 500!).

All 2(a) is doing is cal­cu­lat­ing the resid­u­als. The deriv­a­tive you see is the same as in step 1. Note that to cal­cu­late the resid­u­als its sim­ply plug­ging in val­ues from pre­vi­ous pre­dic­tion (f = f of m‑1) and deriv­a­tive of L comes out to be -(actu­al — pre­dict­ed). Note the minus sign for the over­all val­ue? That’s to get rid of the minus sign in -(actu­al — pre­dict­ed). And what’s ‘pre­dict­ed’ val­ue? It is sim­ply f0(x) when m = 1. It will be f1(x) when m = 2 and so on.

In 2(b), we are build­ing a regres­sion tree to pre­dict the resid­u­als from 2(a). The tree is built with a set num­ber of nodes (or set num­ber of ter­mi­nal leaves) so that we don’t end up grow­ing super large trees. Note that j in this step is leaf index and m is tree index.

In 2©: So, in 2(b) we made a regres­sion tree. But, it might hap­pen that in a num­ber of cas­es the leaves we get have more than one val­ues in them. So, then what’s the one val­ue for a giv­en ter­mi­nal region R? Its sim­ple — it’s the aver­age of the val­ues present in that ter­mi­nal leaf/region! 2© sim­ply updates the tree with final ter­mi­nal region val­ues. This is why the equa­tion in this step resem­bles the equa­tion from step 1, because all it’s doing is aver­ag­ing!

2(d): This does the new pre­dic­tion. All it’s doing is — tak­ing pre­vi­ous mod­el’s pre­dic­tion val­ue and adding to it the resid­ual val­ue to get new pre­dic­tion val­ues for each records.

Note that a learn­ing rate is also usu­al­ly applied to resid­ual val­ue (some­thing between 0 and 1), so as to lessen the effect of vari­ance.

Once 2(d) is done, we go back to 2(a) and cal­cu­late resid­u­als and then pre­dict those resid­u­als and so on, till M.

Once step 2 fin­ish­es, our mod­el (step 3) is ready and pre­dic­tion.

By no means it’s an easy algo­rithm to grasp, but hope­ful­ly you can start with a a very sim­ple data set and work through these steps as explained to get more under­stand­ing of the algo­rithm.