Supervised Learning: Ensemble Learning and AdaBoost

swyx - Jan 27 '19 - - Dev Community

This is the 6th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.

Intuition

The purpose of ensemble learning is to learn weights for a bunch of features that have some unknown relationship to whatever you are trying to classify.

For example, classifying email into spam/not spam could use some factors like the mention of Viagra or cryptocurrency or money from Nigeria. Ensemble models take all these little factors and learn the weights in order to spit out the broader result.

This can be helpful if, for example, the fundamental thing we are trying to classify presents in different ways, for example financial spam is different from NSFW spam. In other words, some rules are better for some subsets of the data than others. So naturally some rules will work better than others depending on context, however at the end of the day we just want an overall rule to throw out spam. So our ensemble learning in this case would learn something close to OR logic. If the rules have some contingent dependencies, i.e. the mail is spam if Rule 1 and Rule 5 are both active, then the ensemble learning approximates AND logic. And so on.

I've portrayed the rules here as something consciously picked. But to go one level deeper, we should apply learners to subsets of data and let it pick the rules.

Bagging/Bootstrap Aggregation

Picking subsets is actually a bigger assumption than it seems. How we pick and separate subsets is up for debate, and whether we should even pick a subset at all assumes that the data is separable that way.

Picking data uniformly randomly actually works pretty well (see random forests vs the decision trees we have already studied). Taking the standard UCI Housing Data example, taking a simple average ensemble of 3rd order polynomials from 5 random subsets of 5 data points, beats a 3rd order polynomial trained over the whole thing. ("beats" here means it does better in the out of sample test/validation set). The intuition here is better generalization/less overfitting by significantly reducing the impact of noise since noise is likely less prevalent in a random subset. This is called Bagging/Bootstrap Aggregation.

However, you do tend to lose interrelationships that you might have picked up on otherwise, and the rules from each subset have correspondingly lower statistical power because of the smaller sample size.

Boosting/AdaBoost

Instead of uniformly randomly picking subsets, we should focus each round of subset-picking on the "hardest" examples.

Instead of combining multiple rules by simple average, we can use a "weighted" mean.

To define what we mean by "hard" and how we "weight", we need to revisit the definition of error. Typically this is the number of wrong classifications (in the discrete classification use case) or the square of departures from the model (in the continuous regression use case). However, some errors don't happen often, and therefore we don't get much opportunity to train on them. The dominance of some states can lead machine learning to overlook important, rarer states. So if 99% of your email is spam, then a simple spam classifier could just learn to classify ALL your mail as spam and have a pretty good 1% error rate.

So we should, in our ensemble learning, try to progressively focus on the examples that are "harder" to get right as compared to the subsets of data that are well taken care of.

Focusing on the classification case, this means modifying each round's distribution focusing on where the model disagrees with our training data (aka the "hardest examples"). Formally, you can represent like this:

https://sw-yx.tinytake.com/media/951a43?filename=1548621831322_27-01-2019-10-43-44.png&sub_type=thumbnail_preview&type=attachment&width=509&height=248&&salt=MzI2MTczMl85NzcxNTg3

Where y and h(x) represent training data and model, and are either +1 or -1. So multiplying them together with a constant alpha term lets us raise (or boost) the importance of Dt by a positive exponent if they disagree, and lower it if they agree. For a fuller explanation of the adaptive boosting (AdaBoost) algorithm, read Robert Schapire's overview paper here. (alt link)

The final hypothesis sums up the results of all the models trained so far and just uses the sign of the result.

Weak Learners

A weak learner is a learner that will do better than chance when it tries to learn your hypothesis rules on any data distribution. This means that it will have some level of information gain (as we defined in our Decision Trees chapter), as compared to chance which has no gain. This includes Decision Trees, Perceptrons, and any other classification/regression model you can image. This makes Boosting a meta-machine learning algorithm built on top of existing algorithms.

The emphasis on any data distribution is important - it means that your rules need to separate the possible space of states such that it makes it possible for a weak learner to do better than chance under any distribution. (see this answer for more) So despite the name, this is a fairly important condition, however it is weak enough for our use.

So boosting relies on weak learners at each step of the way as a hypothesis. They don't need to cleanly slice the dataset in half each time. They just need to do better than random at classifying some part of the dataset - after which, we can further refine and change the distribution to focus on the remaining disagreements. AdaBoost is thus an algorithm for constructing a ”strong” classifier as linear
combination of “simple” “weak” classifiers.

The actual boosting - upweighting of disagreements - doesn't actually even have to happen for this method to work, but it does speed up the convergence by disproportionately focusing on disagreements.

We have done a lot of theory here - so please check this visual example to make sure you nail the underlying intuition in boosting and hypothesis forming.

As you can see here, the power of ensemble learning to compose very simple rules together to map complex spaces is useful. In some way this is analogous to binary search reducing an O(N) problem to an O(log N) problem - by slicing up the dataset progressively each time, we ensure that the final error rate - stuff we cannot classify wrong - is small, and can almost always be improved (more than random) by adding one more layer.

Boosting rarely Overfits

Overfitting is a constant worry in models (we've discussed many times before) - where training error continues to go down but the test/validation error starts going up. Curiously enough, with Boosting models this doesn't happen. We'll discuss more in the next chapter after our SVM discussion, but Schapire's various talks and papers (here and here) discuss some theoretical backing for this.

Next in our series

Schapire has also extended Boosting to the Multi-class case, which has unique challenges, in this lecture.

Further notes on this topic can be found here, with a specific slide deck on AdaBoost here.

Hopefully that was a good introduction to Ensemble Learning and AdaBoost. I am planning more primers and would love your feedback and questions on:

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .