Friday, March 23, 2018

Hands on with Gradient Boosting


Boosting is a meta-algorithm, that is "an algorithm that exists to manipulate some other algorithm".

It goes like this (from [1]):

  1. Draw a random subset of training samples d1 without replacement from the training set D to train a weak learner C1.
  2. Draw second random training sample d2 without replacement from D and 50% of the samples that were misclassified by C1. Give this to C2.
  3. Find the training sample d3 from D for which C1 and C2 disagree. Give this to C3.
  4. Find with majority voting the answers from C1, Cand C3.
Bagging vs. Boosting

"Bagging is a simple ensembling technique in which we build many independent predictors/models/learners and combine them using some model averaging techniques. (e.g. weighted average, majority vote or normal average)...

"Boosting is an ensemble technique in which the predictors are not made independently, but sequentially." (from Prince Grover)

So, as a mnemonic, think booSting is Sequential and bAgging is pArrallel.

Any time, any place anywhere

Because boosting is a meta algorithm, it can be used on lots of classifiers in Knime.

For instance, I used Knime's Palladian nodes to classify the "20 Newsgroups" data set. This algorithm extensively uses n-grams. With a min/max n-gram size of 3/10, it gave an overall accuracy of 88.3% in classification.

Although Palladian's classifier nodes are being used, they could be any classifier like Naive Bayes
So, I boosted the results 10 times... And only got 85.9%. Boosting 100 times gave 78.5%. Hmm, what gives?

This StackOverflow post gives a hint. "In general, boosting error can increase with the number of iterations, specifically when the data is noisy (e.g. mislabeled cases)... Basically, boosting can 'focus' on correctly predicting cases that contain misinformation, and in the process, deteriorate the average performance on other cases that are more substantive.”

The post also has an interesting chat between data scientists about boosting and overfitting. “I think it's interesting that you've rarely seen gradient boosting overfit. Over the four or so years that I've been using it, I've seen the opposite -- too many trees leads to overfitting”.

Indeed, Raschka writes that boosting algorithms are known for the "tendency to overfit the training data".

Features! Features! Features!

What seemed to have the most impact on Palladian's algorithm was the choice of min/max n-grams. A value of 15/15 gave only 67% accuracy comparing poorly to the 88% of 3/10.

[1] Python Machine Learning, Sebastian Raschka

No comments:

Post a Comment