Friday, April 6, 2018

Bigger is not necessarily better in Neural Nets

Using again just the subjects from the "20 Newsgroups" public domain data, we can predict which message belongs to which newsgroup using a TensorFlow neural net. (Code for this blog post can be found here).

But neural networks are so flexible, how do I know which architecture to use? So, naively, I chose a modest size network with 2 hidden layers of size 100 and 20. My accuracy looked like this:
2 Layer ANN. Red dots are training accuracy, blue are testing and the blue line is accuracy for all data.
Pretty bad, roughly the monkey-score.

Here's some ML training advice from Stanford University.
Typical learning curve for high variance [has a] large gap between training and test error. 
Typical learning curve for high bias [has a] small gap between training and test error [and] even the training error is unacceptably high.
So, it looks like my neural net is exhibiting high bias since there is only 0.3% difference between training and test accuracy and they are both unacceptably low (accuracy = 1 - error).

The same source recommends:

  • What fixes high variance: "Try getting more training examples [and] try a smaller set of features"
  • What fixes high bias: "try a larger set of features"

But before I started playing with the features, I read this: "From the description of your problem it seems that a feedforward network with just a single hidden layer in should be able to do the trick" (from this answer on StackOverflow with some more useful hints at improving things). And indeed it worked:
ANN with 1 hidden layer. Red dots are training accuracy, blue are testing and the blue line is accuracy for all data.
Much better! The training accuracy is plateaus at about 98%, the test accuracy about 75% and when running the model on all data it achieves an accuracy of about 87%.

Why the improvement? "Essentially, your first hidden layers learn much slower than last. Normally your network should learn the correct weights. Here, however, most likely the weights in the first layer change very little and the error propagates to the next layers. It's so large that the subsequent layers can't possibly correct for it. Check the weights." (from StackOverflow)

But now my model appears to be exhibiting high variance according to the advice above from Stanford. Further improvement is a future exercise.

The Loss Values

It's interesting to note that the value from the loss function drops precipitously even in the pathological case. A cost function might calculate a better score for predictions that are closer to the true value but if they're still wrong, the accuracy will not change. For instance, if you increase your exam grade from an F to a D, a loss function might indeed reflect that you got better but if the pass grade is C, you still failed. Ergo, a better loss value with exactly the same accuracy. See this StackOverflow answer.

"Lower cost function error [does] not means better accuracy. The error of the cost function represents how well your model is learning/able to learn with respect to your training examples... It can show very low learning curves error but when you actually testing the results (Accuracy) you are getting wrong detections." (from StackOverflow)

No comments:

Post a Comment