A few years back, I was assigned the task of Predicting the probability of students dropping out of college so that there could be some intervention beforehand to help students at risk.

For this exercise I had to first build a Person Matching Algorithm between credit card data and the students data to be able to find out the financial condition of the student’s family — since a lot of times this is also a factor for dropping out. But that’s a topic for another day.

In this post, I am going to write a little bit about the main ML algorithm I used — GBM or Gradient Boosted Machine learning. It’s also one of the top algorithms used in Kaggle competitions and is pretty reliable. When I used this algorithm, I was already well-versed with Decision Trees, Random Forests, etc. but I still had a hard time getting the intuition behind GBM and why it would work and why it was designed the way it is. In this post, I am going to try to break this down for the reader, but the post assumes that the reader is already familiar with how Decision Trees (Regression as well as Classification) are grown.

Before we dive into GBM, we can discuss a little bit about how the world got here -

- Single Trees — First came the Decision Trees. These are single trees made from training data and are only good for visual representation of the problem. Never use them in production because they are very sensitive to data and the prediction values change very easily for even slight variations in training data. Also, definitely not use simple trees when data is skewed in favor of one class — it will almost always predict that same class.
- Bagging — Due to problems arising from single trees, people started to look at — well, what do people look at when single things don’t work well? Create many models and average out the result. That’s called Bagging. It comes from the phrase ‘
**b**ootstrap**agg**regat**ing**’. There are two techniques involved here — bootstrapping (of data) and aggregating (or averaging of models). In the bootstrapping phase, a number of training data sets are created by ‘sampling and replacement’! What does it mean? Well, it simply means that a data set can be created from original data set for training where a percentage of records will be randomly sampled and since sampling is not excluding anything (even records drawn earlier), there could be a lot of records that are repeated. And we create many such data sets — as you can see this introduction of randomness in data basically reduces the possibility of overfitting. That was for the datasets. Now, for the model — instead of choosing just one model, one can create a number of models with certain variations. Now, these m number of models can be fitted with m number of new datasets and result averaged. Random forest is basically just this. To create the different models, when it builds each tree, for each split only a certain number of features are chosen randomly. - Boosting — Random forests work well. The randomness in data and randomness in models mean that the chance of overfitting is greatly reduced. But does that necessarily mean a good model? Not really! And the reason is quite simple. Random forest is basically taking a majority vote (or average value in case of a ‘regression trees’ forest) from a number of independently created models. That’s right. The trees in RF are not related to each other at all. So, there’s no chance that they are learning from each other’s mistakes. That’s where Boosting comes in! It’s a technique where a number of models are created and all are used to get the final vote (or value), but the each new model is built by improving on the previous model. So, boosting is basically saying — ‘I have this stupid initial model, which is basically an average model. Now, let’s build another model/tree based on this such that it’s a little better. And let’s keep doing this till we (let’s say) build 100 models’!

In short, here’s what GBM does — it computes an initial single prediction value. Then it computes residuals for predicted and actual values, then it builds a tree to predict the residuals. Now, it replaces the initial predictions with new ones — with new prediction value being: initial prediction value + predicted residual value. Once we get new prediction values, we again calculate new residuals and build another tree to predict these residuals. We keep doing this for M number of trees, where M can be user input. As you can see, any tree that’s generated depends on earlier prediction value. This is why GBM is said to be additive and it’s an ensembling technique. This is also the main idea behind boosting — using previous predictions to come up with new models. In our case, if you see closely, we build our first tree based on the residuals from the first prediction, we build our second tree based on the residuals from first tree, and so on.

Here’s what GBM looks like -

This can seem daunting at first, but it’s not.

Here’s step 0 — Let’s assume we have some training data available to us of the form (x, y), for i=1 to n, where the y is some regressor (let’s just take regression as an example to understand GBM. Classification is similar but needs some transformations between steps). x here is not one feature but all values for all features for a single record and n is the total number of records. Here’s a visual representation of this data -

Now, let’s get to step 1.

This step is asking us to calculate f0(x). This is a single value. As a matter of fact, GBM starts with a single prediction value. If you were given a bunch of data and asked to give a single value that you would predict for that data, what would it be? You’ll simply take average of y values from training data and for any incoming new data, this is what you’ll predict. As we’ll see, f0(x) here, for regression is basically the average of all y values in the data.

L is the loss function that we need to minimize. Nothing new here, all ML algorithms minimize a loss/cost function. Input for L is y values from training data and the predicted value γ. Note that since we only need one prediction value, this is going to be same for all for all records. Now, for regression loss function usually taken is — 1/2 x (actual — predicted)^2

There are other Loss functions too for regression that one could use but let’s go with sum of squares of residuals, which is also very common. The Loss function must be differentiable and must be convex so that we can apply gradient descent to it. And we know that sum of squares of residuals follows these properties. Below is a rough graph of one feature variable based loss function against prediction values. The shape of the curve is such because of the square term that we have in our loss function. Note that now we need to take derivative of it and set that to 0, to get to the point where we have the pink line touching the graph horizontally, since that’s the point where the value of loss function would be lowest.

If you compute the derivate, it will come out to be -> -(actual — predicted). Now, let’s say we have 3 records where y was 1, 4, 5. Putting these in the equation and setting it to zero we get, we get -> -(1 — predicted) + (-(4 — predicted)) + (-(5 — predicted)) = 0. Solving this for predicted, we get ->

predicted = (1 + 4 + 5) / 3

So, f0(x) is nothing but average of all y values.

Let’s move to step 2.

In this step, M is the total number of trees we want to build. This is usually a user input, like 100 (or 500!).

All 2(a) is doing is calculating the residuals. The derivative you see is the same as in step 1. Note that to calculate the residuals its simply plugging in values from previous prediction (f = f of m‑1) and derivative of L comes out to be -(actual — predicted). Note the minus sign for the overall value? That’s to get rid of the minus sign in -(actual — predicted). And what’s ‘predicted’ value? It is simply f0(x) when m = 1. It will be f1(x) when m = 2 and so on.

In 2(b), we are building a regression tree to predict the residuals from 2(a). The tree is built with a set number of nodes (or set number of terminal leaves) so that we don’t end up growing super large trees. Note that j in this step is leaf index and m is tree index.

In 2©: So, in 2(b) we made a regression tree. But, it might happen that in a number of cases the leaves we get have more than one values in them. So, then what’s the one value for a given terminal region R? Its simple — it’s the average of the values present in that terminal leaf/region! 2© simply updates the tree with final terminal region values. This is why the equation in this step resembles the equation from step 1, because all it’s doing is averaging!

2(d): This does the new prediction. All it’s doing is — taking previous model’s prediction value and adding to it the residual value to get new prediction values for each records.

Note that a learning rate is also usually applied to residual value (something between 0 and 1), so as to lessen the effect of variance.

Once 2(d) is done, we go back to 2(a) and calculate residuals and then predict those residuals and so on, till M.

Once step 2 finishes, our model (step 3) is ready and prediction.

By no means it’s an easy algorithm to grasp, but hopefully you can start with a a very simple data set and work through these steps as explained to get more understanding of the algorithm.