Understanding GBM intuitively

A few years back, I was assigned the task of Pre­dict­ing the prob­a­bil­i­ty of stu­dents drop­ping out of col­lege so that there could be some inter­ven­tion before­hand to help stu­dents at risk.
For this exer­cise I had to first build a Per­son Match­ing Algo­rithm between cred­it card data and the stu­dents data to be able to find out the finan­cial con­di­tion of the stu­den­t’s fam­i­ly — since a lot of times this is also a fac­tor for drop­ping out. But that’s a top­ic for anoth­er day.

In this post, I am going to write a lit­tle bit about the main ML algo­rithm I used — GBM or Gra­di­ent Boost­ed Machine learn­ing. It’s also one of the top algo­rithms used in Kag­gle com­pe­ti­tions and is pret­ty reli­able. When I used this algo­rithm, I was already well-versed with Deci­sion Trees, Ran­dom Forests, etc. but I still had a hard time get­ting the intu­ition behind GBM and why it would work and why it was designed the way it is. In this post, I am going to try to break this down for the read­er, but the post assumes that the read­er is already famil­iar with how Deci­sion Trees (Regres­sion as well as Clas­si­fi­ca­tion) are grown.

Before we dive into GBM, we can dis­cuss a lit­tle bit about how the world got here -

  1. Sin­gle Trees — First came the Deci­sion Trees. These are sin­gle trees made from train­ing data and are only good for visu­al rep­re­sen­ta­tion of the prob­lem. Nev­er use them in pro­duc­tion because they are very sen­si­tive to data and the pre­dic­tion val­ues change very eas­i­ly for even slight vari­a­tions in train­ing data. Also, def­i­nite­ly not use sim­ple trees when data is skewed in favor of one class — it will almost always pre­dict that same class.
  2. Bag­ging — Due to prob­lems aris­ing from sin­gle trees, peo­ple start­ed to look at — well, what do peo­ple look at when sin­gle things don’t work well? Cre­ate many mod­els and aver­age out the result. That’s called Bag­ging. It comes from the phrase ‘boot­strap aggregating’. There are two tech­niques involved here — boot­strap­ping (of data) and aggre­gat­ing (or aver­ag­ing of mod­els). In the boot­strap­ping phase, a num­ber of train­ing data sets are cre­at­ed by ‘sam­pling and replace­ment’! What does it mean? Well, it sim­ply means that a data set can be cre­at­ed from orig­i­nal data set for train­ing where a per­cent­age of records will be ran­dom­ly sam­pled and since sam­pling is not exclud­ing any­thing (even records drawn ear­li­er), there could be a lot of records that are repeat­ed. And we cre­ate many such data sets — as you can see this intro­duc­tion of ran­dom­ness in data basi­cal­ly reduces the pos­si­bil­i­ty of over­fit­ting. That was for the datasets. Now, for the mod­el — instead of choos­ing just one mod­el, one can cre­ate a num­ber of mod­els with cer­tain vari­a­tions. Now, these m num­ber of mod­els can be fit­ted with m num­ber of new datasets and result aver­aged. Ran­dom for­est is basi­cal­ly just this. To cre­ate the dif­fer­ent mod­els, when it builds each tree, for each split only a cer­tain num­ber of fea­tures are cho­sen ran­dom­ly.
  3. Boost­ing — Ran­dom forests work well. The ran­dom­ness in data and ran­dom­ness in mod­els mean that the chance of over­fit­ting is great­ly reduced. But does that nec­es­sar­i­ly mean a good mod­el? Not real­ly! And the rea­son is quite sim­ple. Ran­dom for­est is basi­cal­ly tak­ing a major­i­ty vote (or aver­age val­ue in case of a ‘regres­sion trees’ for­est) from a num­ber of inde­pen­dent­ly cre­at­ed mod­els. That’s right. The trees in RF are not relat­ed to each oth­er at all. So, there’s no chance that they are learn­ing from each oth­er’s mis­takes. That’s where Boost­ing comes in! It’s a tech­nique where a num­ber of mod­els are cre­at­ed and all are used to get the final vote (or val­ue), but the each new mod­el is built by improv­ing on the pre­vi­ous mod­el. So, boost­ing is basi­cal­ly say­ing — ‘I have this stu­pid ini­tial mod­el, which is basi­cal­ly an aver­age mod­el. Now, let’s build anoth­er model/tree based on this such that it’s a lit­tle bet­ter. And let’s keep doing this till we (let’s say) build 100 mod­els’!

In short, here’s what GBM does — it com­putes an ini­tial sin­gle pre­dic­tion val­ue. Then it com­putes resid­u­als for pre­dict­ed and actu­al val­ues, then it builds a tree to pre­dict the resid­u­als. Now, it replaces the ini­tial pre­dic­tions with new ones — with new pre­dic­tion val­ue being: ini­tial pre­dic­tion val­ue + pre­dict­ed resid­ual val­ue. Once we get new pre­dic­tion val­ues, we again cal­cu­late new resid­u­als and build anoth­er tree to pre­dict these resid­u­als. We keep doing this for M num­ber of trees, where M can be user input. As you can see, any tree that’s gen­er­at­ed depends on ear­li­er pre­dic­tion val­ue. This is why GBM is said to be addi­tive and it’s an ensem­bling tech­nique. This is also the main idea behind boost­ing — using pre­vi­ous pre­dic­tions to come up with new mod­els. In our case, if you see close­ly, we build our first tree based on the resid­u­als from the first pre­dic­tion, we build our sec­ond tree based on the resid­u­als from first tree, and so on.

Here’s what GBM looks like -

This can seem daunt­ing at first, but it’s not.

Here’s step 0 — Let’s assume we have some train­ing data avail­able to us of the form (x, y), for i=1 to n, where the y is some regres­sor (let’s just take regres­sion as an exam­ple to under­stand GBM. Clas­si­fi­ca­tion is sim­i­lar but needs some trans­for­ma­tions between steps). x here is not one fea­ture but all val­ues for all fea­tures for a sin­gle record and n is the total num­ber of records. Here’s a visu­al rep­re­sen­ta­tion of this data -

Now, let’s get to step 1.
This step is ask­ing us to cal­cu­late f0(x). This is a sin­gle val­ue. As a mat­ter of fact, GBM starts with a sin­gle pre­dic­tion val­ue. If you were giv­en a bunch of data and asked to give a sin­gle val­ue that you would pre­dict for that data, what would it be? You’ll sim­ply take aver­age of y val­ues from train­ing data and for any incom­ing new data, this is what you’ll pre­dict. As we’ll see, f0(x) here, for regres­sion is basi­cal­ly the aver­age of all y val­ues in the data.

L is the loss func­tion that we need to min­i­mize. Noth­ing new here, all ML algo­rithms min­i­mize a loss/cost func­tion. Input for L is y val­ues from train­ing data and the pre­dict­ed val­ue γ. Note that since we only need one pre­dic­tion val­ue, this is going to be same for all for all records. Now, for regres­sion loss func­tion usu­al­ly tak­en is — 1/2 x (actu­al — predicted)^2

There are oth­er Loss func­tions too for regres­sion that one could use but let’s go with sum of squares of resid­u­als, which is also very com­mon. The Loss func­tion must be dif­fer­en­tiable and must be con­vex so that we can apply gra­di­ent descent to it. And we know that sum of squares of resid­u­als fol­lows these prop­er­ties. Below is a rough graph of one fea­ture vari­able based loss func­tion against pre­dic­tion val­ues. The shape of the curve is such because of the square term that we have in our loss func­tion. Note that now we need to take deriv­a­tive of it and set that to 0, to get to the point where we have the pink line touch­ing the graph hor­i­zon­tal­ly, since that’s the point where the val­ue of loss func­tion would be low­est.

If you com­pute the derivate, it will come out to be -> -(actu­al — pre­dict­ed). Now, let’s say we have 3 records where y was 1, 4, 5. Putting these in the equa­tion and set­ting it to zero we get, we get -> -(1 — pre­dict­ed) + (-(4 — pre­dict­ed)) + (-(5 — pre­dict­ed)) = 0. Solv­ing this for pre­dict­ed, we get ->

pre­dict­ed = (1 + 4 + 5) / 3
So, f0(x) is noth­ing but aver­age of all y val­ues.

Let’s move to step 2.
In this step, M is the total num­ber of trees we want to build. This is usu­al­ly a user input, like 100 (or 500!).

All 2(a) is doing is cal­cu­lat­ing the resid­u­als. The deriv­a­tive you see is the same as in step 1. Note that to cal­cu­late the resid­u­als its sim­ply plug­ging in val­ues from pre­vi­ous pre­dic­tion (f = f of m‑1) and deriv­a­tive of L comes out to be -(actu­al — pre­dict­ed). Note the minus sign for the over­all val­ue? That’s to get rid of the minus sign in -(actu­al — pre­dict­ed). And what’s ‘pre­dict­ed’ val­ue? It is sim­ply f0(x) when m = 1. It will be f1(x) when m = 2 and so on.

In 2(b), we are build­ing a regres­sion tree to pre­dict the resid­u­als from 2(a). The tree is built with a set num­ber of nodes (or set num­ber of ter­mi­nal leaves) so that we don’t end up grow­ing super large trees. Note that j in this step is leaf index and m is tree index.

In 2©: So, in 2(b) we made a regres­sion tree. But, it might hap­pen that in a num­ber of cas­es the leaves we get have more than one val­ues in them. So, then what’s the one val­ue for a giv­en ter­mi­nal region R? Its sim­ple — it’s the aver­age of the val­ues present in that ter­mi­nal leaf/region! 2© sim­ply updates the tree with final ter­mi­nal region val­ues. This is why the equa­tion in this step resem­bles the equa­tion from step 1, because all it’s doing is aver­ag­ing!

2(d): This does the new pre­dic­tion. All it’s doing is — tak­ing pre­vi­ous mod­el’s pre­dic­tion val­ue and adding to it the resid­ual val­ue to get new pre­dic­tion val­ues for each records.

Note that a learn­ing rate is also usu­al­ly applied to resid­ual val­ue (some­thing between 0 and 1), so as to lessen the effect of vari­ance.

Once 2(d) is done, we go back to 2(a) and cal­cu­late resid­u­als and then pre­dict those resid­u­als and so on, till M.

Once step 2 fin­ish­es, our mod­el (step 3) is ready and pre­dic­tion.

By no means it’s an easy algo­rithm to grasp, but hope­ful­ly you can start with a a very sim­ple data set and work through these steps as explained to get more under­stand­ing of the algo­rithm.

Leave a Reply

Your email address will not be published. Required fields are marked *