Agile Java Man: Hands on with Gradient Boosting

Friday, March 23, 2018

Hands on with Gradient Boosting

Boosting is a meta-algorithm, that is "an algorithm that exists to manipulate some other algorithm".

It goes like this (from [1]):

Draw a random subset of training samples d₁ without replacement from the training set D to train a weak learner C₁.
Draw second random training sample d₂ without replacement from D and 50% of the samples that were misclassified by C₁. Give this to C₂.
Find the training sample d₃ from D for which C₁ and C₂ disagree. Give this to C₃.
Find with majority voting the answers from C₁, C₂and C₃.

Bagging vs. Boosting

"Bagging is a simple ensembling technique in which we build many independent predictors/models/learners and combine them using some model averaging techniques. (e.g. weighted average, majority vote or normal average)...

"Boosting is an ensemble technique in which the predictors are not made independently, but sequentially." (from Prince Grover)

So, as a mnemonic, think booSting is Sequential and bAgging is pArrallel.

Any time, any place anywhere

Because boosting is a meta algorithm, it can be used on lots of classifiers in Knime.

For instance, I used Knime's Palladian nodes to classify the "20 Newsgroups" data set. This algorithm extensively uses n-grams. With a min/max n-gram size of 3/10, it gave an overall accuracy of 88.3% in classification.

Although Palladian's classifier nodes are being used, they could be any classifier like Naive Bayes

So, I boosted the results 10 times... And only got 85.9%. Boosting 100 times gave 78.5%. Hmm, what gives?

This StackOverflow post gives a hint. "In general, boosting error can increase with the number of iterations, specifically when the data is noisy (e.g. mislabeled cases)... Basically, boosting can 'focus' on correctly predicting cases that contain misinformation, and in the process, deteriorate the average performance on other cases that are more substantive.”

The post also has an interesting chat between data scientists about boosting and overfitting. “I think it's interesting that you've rarely seen gradient boosting overfit. Over the four or so years that I've been using it, I've seen the opposite -- too many trees leads to overfitting”.

Indeed, Raschka writes that boosting algorithms are known for the "tendency to overfit the training data".

Features! Features! Features!

What seemed to have the most impact on Palladian's algorithm was the choice of min/max n-grams. A value of 15/15 gave only 67% accuracy comparing poorly to the 88% of 3/10.

[1] Python Machine Learning, Sebastian Raschka

Agile Java Man

Friday, March 23, 2018

Hands on with Gradient Boosting

No comments:

Post a Comment

Blog Archive

About Me