I’ve often found myself understanding concepts much better when I understand the geometrical interpretation behind that concept. This is especially true for a lot of Math, Machine Learning and AI (although there are definitely a lot of abstract concepts in this field too that often don’t have a geometrical interpretation associated with them).
There might be many posts out there on RoC and what it is and how it’s used to determine the probability threshold for a classification problem, but how is the geometry of the problem changes when we pick another threshold value is something I’ve never encountered anywhere. And this post, I am going to try and intuitively explain everything. ELIF, or what’s the point, right?
Let’s start with a simple example that’s often used. Let’s say we are faced with predicting whether a tumor is benign or malignant, and assume we have a bunch of features available to us to do this task. For illustration purposes, I am going to keep the graphs and equations really simple, because the point it to understand the concept to be able to apply it later in more complex scenarios.
So, we go ahead and build a classification model and the line/curve it comes up with to separate the categories is shown below -
So, as you can see above the line does a pretty job of classification. The accuracy is high too. So, do we care about anything more? Yep! In the case above, there are two malignant tumors that have been classified as benign. And that’s not a good thing for this particular problem. It depends on the problem really, whether we want our criteria of success to be overall accuracy or true positive rate or false negative rate, etc. For a problem like above, we want to definitely classify as much malignant tumors correctly as possible without caring much about overall accuracy or whether we misclassify some more benign one as being malignant. This is because the goal of such a classification problem should be to identify all malignant cases and intervene for patient care.
So, for problems where we consider both classes to be of equal importance we can consider 0.5 probability as our threshold value (anything above or equal to 0.5 is one class and anything below is another class). But for problems like above, if we can choose a threshold such that it also covers the left out malignant ones, that would be great. And as I will show, choosing a different threshold basically means to move our curve/line of separation a little bit.
But before that here’s some math. You will note the following things from the image below -
- Curve/line of separation corresponds to z = 0, which happens at probability of 0.5. For ease of illustration purposes y = x has been chosen
- Region above curve/line of separation is where z is +ve and corresponds to region where probability is larger than 0.5. Illustrated by point (3, 5)
- Region below curve/line of separation is where z is ‑ve and corresponds to region where probability is less than 0.5. Illustrated by point (5, 3)
So, as seen from above probability of 0.5 corresponds to like y — x = 0
What about probability of 0.4 or less? What does that mean geometrically?
So, we know that anything less than 0.5 means we’re talking about the region below the line. A probability p < 0.5 means that z is negative. For illustration purposes, let’s just assume some probability p(<0.5, maybe it’s 0.4 or lesser) such that value of z is ‑0.7 (we can see from sigmoid function that z will be negative when probability goes below 0.5). What does z = ‑0.7 mean for the curve?
Well, yes, of course it means y — x = ‑0.7 or y = x — 0.7, which is basically another line shifted down a bit from the original line, as shown below -
So, essentially setting threshold to anything but 0.5 shifts the line or curve. And now, we can see the benefit of such shifting — the previously malignant tumors which were classified as benign are now being classified correctly. So, the idea is to shift the line/curve such that it captures what we want to do for our problem — in this case we wanted to increase True Positive Rate, without being too nosy about overall accuracy. So, now the question is — how do we actually pick a good threshold? And that is exactly what the RoC (Receiver Operative Curve) lets us do.
Now that we know why we have reached here, the idea is pretty simple — try out a range of probabilities (shift your curve/line a lot above z = 0 and below z = 0) and for each probability, capture the True Positive Rate (generally you want this to increase) as well as the False Positive Rate (generally you want this to decrease). Now, plot these out. Something like below, and choose the value of probability threshold that makes more sense for your problem. Also, you can use RoC curve to find out AuC (Area Under the Curve) to get a sense of how your model is doing. You can also plot out various models’ RoC curves and then see which model and which probability threshold makes most sense for your problem -