Monday, November 30, 2015

Machine Learning with Spark

Playing on my current pet project, I implemented a Support Vector Machine using Spark's Machine Learning library without fully understanding what it was doing and just following the documentation.

But the documentation is somewhat sparse and I got more understanding by going elsewhere, mainly this excellent book. Here I learned that SVMs are a classification algorithm, that is "a supervised learning method that predicts data into buckets". This contrasts with a regression algorithm which is "the supervised method that predicts numerical target values." Furthermore, it can be a non-parametric algorithm allowing it to handle non-linear data. "Parametric models makes assumptions about the structure of the data. Non-parametric models don’t." [1]

OK, that's some terminology defining the algorithms characteristics out of the way. But then I hit the Spark documentations that talks about the AUC (Area Under Curve) of the ROC (Receiver Operating Characteristics). Hmm.

The concepts are quite easy but I found a dearth of material on the web [addendum: since writing this, I found this excellent post]. So, here is what I gleaned.

First, we start with something called the "Confusion Matrix". This is fairly simple 2x2 matrix of true/false positive/negative rates.

For example, imagine a test for cancer. Some people will have it, some won't. Some people will be told they have it, some won't. Sometimes we tell people they have it when they don't. Sometimes we don't tell people they have it when they do. Oops.

The confusion matrix would look like this:

TruePatient is told he has cancer and indeed he does. Patient does not have cancer and is told so
False Patient is told he has cancer when he does not. Patient is not told he has cancer when he does

If the total number of patients is N, the number told they have cancer is Y and the number who actually have cancer is X, the rates look like this:

TrueX / Y (N - X) / (N - Y)
False (N - X) / Y X / (N - Y)

We can plug the real numbers in and get a matrix with 4 cells each between 0 and 1. Depending how high our threshold is for the cancer test, these numbers will jump around. So, if we varied our threshold between a minimum 0 and a maximum 1, we could plot the true positive against the false positive. This graphics should illustrate:

where the threshold is on the z-axis and the ROC is on the x-y plane.

The axis label 'true positive' indicates the sensitivity (TP / TP+FN) and the axis labeled 'false positive' is the specifity (TN / TN+FP) - see here for further details.

Incidentally, the graph was plotted using everybody's favourite maths tool, R. The code was:


step <- 0.01
f <- function(v) v ^ 3
x <- f(seq(0,1,step))
y <- seq(0,1,step)
h <- function(x,y) y
z <- c(1:100) 

c = z
c = cut(c, breaks=64)
cols = rainbow(64)[as.numeric(c)]

pairwiseFlatten <- function(binded) {
  bound <- c()
  for (i in 1:(length(binded)/2)) bound = c(bound, binded[i,1], binded[i,2])
  return (bound)

plot3d(x, y, h(x,y), add=TRUE, col=cols)
plot3d(x, y, 0, add=TRUE)
segments3d(pairwiseFlatten(cbind(x,x)), y = pairwiseFlatten(cbind(y,y)), z = pairwiseFlatten(cbind(h(x,y),0)), add=TRUE)

decorate3d(xlab="false positive", ylab="true positive", zlab="threshold", main="ROC curve")

Anyway, we ideally want to maximize the number of true positives no matter what threshold value we have. That is, the curve should hug the top left part of the x-y plane. A curve that represented this maxima would have the greatest area in the x-y plane over all the curves. And this is where our AUC for the ROC comes in. The higher, the better our model.

[1] Real World Machine Learning - Manning.

No comments:

Post a Comment