Gradient Boosting Machine Learning
Professor Hastie takes us through Ensemble Learners like decision trees and random forests for classification problems. Don’t just consume, contribute your code and join the movement. Professor Hastie explains machine learning algorithms such as Random Forests and boosting tree depth.
Talking Points:
- Information on Trees
- How does Bagging Work?
- How do Random Forests work?
- How Does Boosting Work?
- Stagewise Additive Modeling
- Comparing and contrasting Bagging and Random Forests
- Tree Size
- Learning Ensembles
- Backing Off the Residual Analysis
- Explainability for Forests
- Problems Boosted Trees Will Overfit
- Do Deeper Trees Fit Better?
- Random Forests and Boosting Considering Tree Depth
- Uncertainty of Results
Speakers:
Trevor Hastie, Professor of Statistics, Stanford University
Read the Full Transcript
Trevor Hastie:
Good morning, everybody. It's a big pleasure to be here. This is a talk that normally takes an hour, but she told me to do it in half an hour. I am going to have to talk twice as fast, like a computer scientist instead of a statistician. She also asked me to talk about ensemble learners, which entails two of the tools that are included in the H2O suite. These are very powerful tools, that I am very fond of myself. So I am going to tell you about them. So these are Random Forests and Boosting. These are not brand new ideas, but they are still very effective tools that everybody who does data science should be familiar and should be in their bag of tools. They both build on trees which go back a long way.
Information on Trees
I am going to tell you a little bit about trees first. To do that, I will start with a little toy example. Here we have a red class and a green class. You can see the red class is in the middle. The green class is on the outside. There is some overlap and we want to classify red from green, given the X1 and X2 coordinates. Shown in black is the ideal decision boundary. If we knew the population from which the data came, it's called the base decision boundary. We can use this as a little toy example in two and higher dimensions. Here is the same problem, but there is no overlap. Sometimes you have classification problems of this kind too. The red is entirely contained inside the circle. The green is entirely outside, but you only get given a set of points and you have to estimate this decision boundary.
That's another version of the problem. So this is a classification tree for trying to solve that problem. I think you are probably all familiar with these kinds of trees. What it does is it asks a series of questions. You will come in with a pair of coordinates for X1 and X2 and it will ask you a question. So is X2 less than 1.06? If yes, you go left. If no, you go right. If you go, right, you will ask some more questions. Eventually, you get down to a terminal note and it will say you're inside or outside, are you red or green? So it asks these questions.
This can be fairly effective for building a classifier. It decides which variables to ask the questions of in this case, the only two. It does not have too much choice and it can build a flexible decision boundary. Here is the decision boundary for this data. This is the noiseless case. And you see it's made like a boxy decision boundary to approximate the circle. That box would come from asking these coordinated questions. Are we left? Are we right? And so on. At the end of the day, all the points that the tree classifies as red are inside the box and all those outside are green. So it does not do a perfect job. It's noisy. This is only a two-dimensional problem. It does pretty well.
It makes 7% errors, but in fact you could actually get 0% errors if you knew the truth, because this there is no noise in this problem. Okay. Now if you take the same problem and turn it into a 10 dimensional problem. Imagine I can have a ball in 10 dimensions with the red inside and the green outside. Then this tree would make 30% errors. So the problem gets harder as you go into higher dimensions. If you add some noise variables, it's going to even get worse, but a tree still seems fairly flexible. It's non-parametric and can find automatic rules fairly easy. So yes, some of the properties of trees are very good at handling very big data sets and they can handle mixed predictors.
That means you have got continuous variables and categorical variables, trees handle them equally well. It can ignore redundant variables. If you have redundant variables that don't really have anything to say, the tree will never choose them for any of the split variables. Trees can handle missing data fairly elegantly. Small trees are easy to interpret. Like this tree is a relatively small tree. The data are rather vanilla in this case, but we could interpret this tree by seeing which splits had cut on. On the other hand, large trees are hard to interpret. Often we need a very bushy tree before we get good performance. Those are really hard to interpret, right? Cause there is so many variables involved in the split, but the most damaging thing about trees is that the prediction performance is often poor and it turns out that's because largely because of high variance. So trees don't give very good prediction performance for general problems.
The rest of the talk is about ways for leveraging trees to improve their performance. There are two main ideas here. One is "Bagging" and later "Random Forests", which was due to Leo Bryman. The idea there was that an individual tree is very noisy. It's subject to variance. So if you can find a way of averaging many different trees, you can bring down the variance through averaging. "Bagging" and "Random Forests" are ways of doing that. Random Forest is better than Bagging. So that's a very effective method. And then there is Boosting, which also is a way of averageing trees. It does it in a way that also learns from the eras of the previous trees. Whereas Random Forests has just averaging independent trees growing to the data.
I think this is an accurate statement of dominance. Boosting dominates Random Forests, which dominates Bagging and they all dominate a single tree. But Boosting is considerable. It's slightly more work, to quite a bit more work to train, tune, and fiddle. Where Random Forests are more or less out the box. Okay. Here is Bagging. There is lots of details by the way, these little pictures here, you probably can't see, but that's a picture of our book. I am referring to pages and sections of the book. If you notice at the top this indicates that this talk is actually a part of a course that Rob Tibshirani and I teach. I've taken the slides from the course. So if any of you were thinking of signing up for the course, you can ask for a discount cause you have already seen this part of it.
How does Bagging Work?
Good luck with that because my wife is the course administrator. So how does Bagging work? You want to get different trees so that you can average them, but the trees are trying to solve the same problem. One way to make them different is to shake the data up a little bit. A way to shake the data is to take a random sample of the training data and not use the same training data. If you change the data, the tree that gets grown will be different. If they are different enough, you can average these trees. When you average things that are different, you usually bring the variants down. Leo Byman's idea was to do bootstrap sampling of the training data. That means you sample. If you have got end training points, sample end points with replacement from the training data, that will give you a different sample.
Some data points will appear more than once, some not at all and that shakes the data up. Grow many trees to bootstrap samples (thousands of trees) and then average them and then use that as your predictor. Now you can average them in different ways. The way I prefer these days is each tree will actually, if you're doing a classification problem, will at any given terminal note. It will give you an estimate of the probability of class one versus class two. Each tree will give you an estimate of the probability at this particular point that you want to make the prediction. You just average those probabilities and you will get a better probability, less variance.
So this is just a cartoon, showing the first tree to the original data and then trees grow to bootstrap samples. There is a few of them there, but as I said, you are going to do thousands. Well, that cleans up on this little problem to a certain extent in the two dimensional case. Notice that boxy decision boundary has become much smoother now because it's averaged many little boxy decision boundaries. It's smoother and got rid of the variability. Of course the classification performance improves as well. So now we're down to 0.03 from 0.07, right? So it's half the misclassification error.
How do Random Forests work?
The bootstrap sampling turns out that wasn't enough. If you want to get the benefit of variance reduction by averaging, let's just say variables. You would like them to be uncorrelated. If they are completely uncorrelated, when you average them, the variance goes down as one over the number that you're averaging. If they are correlated that limits the amount by which you can reduce the variance. The key is to have these trees in some sense, uncorrelated with each other. The more uncorrelated, the more you will bring down the variants. Leo Bryman had another great idea, which was a way of decorrelating these trees some more. That was to introduce an additional randomness when you grow the tree. When you grow the tree, each time you make a split, you would search over all the variables and look for the best variable and the best split point to make the split.
So a Random Forest does the following: It says I am going to limit you to only M variables out of the full P. So let's say P was a hundred variables and I am going to limit you to five variables. You pick at random, five of the variables and you search for the splitting variable amongst those five. Each time you come to split at a new place, you do the same thing, right? There is this random randomness in which variable used for splitting. That decorrelates the trees more. It means you get much better reduction in variance by doing the averaging. This is a ROC curve drawn in hasty style. Usually they go this way, right? Which shows Random Forests outperforms, in this case, a support vector machine and in green Random Forests in red, and it also outperforms a single tree in classifying some spam data.
I had another picture of Random Forests. Well, that will come in a bit. Random Forests have some other nice features as well. For example, you get for free in growing this Random Forests. What's called out of bag error rate, which is leave one out cross-validation. You get it completely for free during the same process as growing the forest, you get this error rate for free. So it's a one pass through the data growing these trees and you're done. It's really an out-of-the-box method. This is just a little picture showing how the correlation between trees goes down as the number of variables you pick. It gets smaller each time you come to split. So if you randomly pick one variable out of the pool, the correlation between the trees is the smallest.
How Does Boosting Work?
The more you pick the higher it goes. This is just to show how the correlation works. There is a chapter in our book that describes all of this in more detail. So that's Random Forests. I will show you some further pictures on Random Forests in a bit. Then finally, we are going to talk about Boosting. So Boosting is another way of averaging trees, but it's a little different. This is the original description of AdaBoost, which was proposed by Freund and Shapiro back in the late nineties. And you see it's again, it's an average of trees and we going to sum them up and put coefficients in front of them and that's going to give us our new tree. Boosting works slightly differently. Each time Boosting will look at how well it's doing and it will try and reweight the data to give more weight to areas where it's not doing so well and then grow a tree to fix up.
It's not growing in independent of identically distributed trees. It's actually trying to go after the places where the current model is deficient. Boosting appears to do better than Random Forests and Bagging. So yes, some performance pictures on those nested sphere examples. So what I am showing you is test error on the left, which you want to be small, right? And number of terms, which in this case is number of trees. So we see Bagging in red on the test data drops down and then sort of levels off. Here the performance of Boosting, which does in this case quite a bit better. This is 20%. This is 10%. This is the nested spheres in 10 dimensions where single tree got 30% errors. So Bagging gets 20%, Boosting gets 10% errors. Okay. The AdaBoost came with a very detailed algorithm of how you reweight the data. The idea is you're going to grow some trees and then based on the performance of the model so far, you are going to reweight the observations. They give very specific recipes for how you do the re-weighting and then you grow tree to the weighted data. They also gave you recipe for how to compute those weights alpha. So the details are all there. It's fairly intuitive. We won't dwell on that now.
Now one of the remarkable things, when you boost with Random Forests, you tend to grow quite bushy trees so that they don't have much bias, but they have got high variance. Then you get rid of the variance by averaging, but with Boosting, depending on the problem, you might actually grow very shallow trees. In this case for this problem we're using boosting with stumps. A stump is a tree with a single split. That's about the simplest tree you can get, right? One split, two terminal nodes, it's called a stump. Here we see in the performance of Boosting with stumps on this nested-spheres problem. It does really well. This is added boosted gig. It's down to 1%. There is something really special going on with Boosting. Something else at first shocked the community. What I am showing you is training error in green and test error in red. What you see is training error goes down. Now, at some point, the training error hits zero and stays zero, but the test error continues going down.
That makes it clear that Boosting is not using training error as the basis for learning its trees, right? It must be using something else. Cause once training area is zero, there would be nothing left to do, right?So that's interesting. Here is another interesting plot because the early Boosting people would say that Boosting never overfits. It's not hard to make problems where Boosting does overfit and gives an example. The training error keeps on going down, but at some point the test error starts increasing, which means you overfit in.
Stagewise Additive Modeling
Turns out what Boosting is doing is building a rather special kind of model. We call it Stagewise Additive Modeling. This is like a generic notation. You can think of each of these bees being a tree, which is a function of, of your variables X. It's got some parameters, gamma, which would be the splitting variables, the values who split it, and perhaps the values in the terminal nodes. Then we're going to add these trees together and give them coefficient beta. Sometimes these bees are called weak learners. They like primitive predictors. When you think of it from this point of view, this is similar to models we fit in statistics all the time. For example, generalized additive models are special additive models where each function is a function of a single coordinate.
Whereas this year is a function of the whole vector X of predictors. There is things called basis functions. For example, polynomial expansions which is of this form. Okay. There are many others. One big distinction is that when you have models of the kind in statistics, we optimize all the parameters jointly and you have got this big sea of parameters and you try and optimize them. Jointly neural networks are like that too. We optimize all the parameters jointly. The thing about Boosting is that we optimize the parameters in what's called a stagewise fashion. Each time we going to optimize the parameters of the next tree or whatever it is we grow in, but all the rest is held fixed. So it's called stagewise fitting and it slows down the rate at which you overfit by not going back and adjusting what you did in the past. That might sound like it's a handicap or it's a limitation. It turns out that's a very reasonable way of regularizing the rate at which you overfit the data.
Even though Boosting was first proposed for the classification problem, it's really easier to understand for a regression problem. In that case, what it really amounts to is repeatedly fit in the residuals, right? That's about as easy as it gets. Let's say we have a regression problem. We got a continuous response. We start off with no terms in the model. Say our model is zero and the residuals, just the response factor, why? So now you just go in a loop and at each stage you fit a regression tree to the response, giving you some fun little function G of X. Then maybe you shrink it down a little bit. We'll talk about this shrinkage in a minute. Then you update your model by adding this term that you have just created to the model and then updating your residuals, right?
So each time you take the residuals, you're going to grow a little tree to fix up what you're missing. You add that to your function, subtract it from your residuals and keep going. So constantly fit the residuals. Each time the thing you use to fit it is some primitive little tree, but it's trying to fix up the places in the feature space where you're doing badly. Constantly fixing up the residuals and Adaboost was doing that in its own way for classification problems. This shrinkage is actually important too. It slows down the rate of overfitting even more. You grow a little tree to the residuals, but instead of just using that tree, you shrink it down towards zero, by some amount epsilon and epsilon can be 0.01 right? That updates the model only by a small amount. Now you grow a tree to the residuals again, but you might get a slightly different tree now. It gives you an opportunity to get lots and lots of little different trees to fix up the residuals and not use up all your data in big chunks, right? All your degrees are freedom. This is a cartoon, but it shows that by slowing down your Epstein 0.01. It takes longer to overfit. This is meant to be test error.
This next part I am just going to skim over. There is a section in our book which explains how Adaboost was actually fit in if you study it from the right point of view. Well, right, from a particular point of view, you can see that it's fit in an additive logistic regression model. So it's actually AdaBoost is modeling the log odds of class one versus class minus one, by sum of trees. It's using an unusual loss function, the exponential loss function and as a basis for optimization as opposed to zero one error, right misclassification error. Which we saw got down to zero, this is more like a likelihood. When you look at it from that point of view, there is a, there is a picture which you can compare different loss functions. For example, this exponential loss, which I believe this blue curve to binomial log-likelihood, which is the red or the green curve. That puts in the same framework as logistic regression with binomial deviance as the loss function.
Understanding Boosting as minimizing a loss function in this kind of stagewise fashion that sets the stage for creating a more general class of Boosting algorithms. That's what's actually represented in the functions GBM in RN in H2O. The general model is as follows: it's stagewise fit in, you have got a loss function, you have got a response, you have your current model, which is FM-1, where you might have already M-1 terms. Now what you try and do at each stage is figure out what's the best little improvement to make to your current model. In this case we required to find a little function B right, which is a function of the features and some parameters and a coefficient in front of it, beta. This is all we need to optimize.
We need to minimize with respect to beta and gamma to find this piece. Then we would update our function by adding that piece in or we might shrink it down before updating the function. So that's Stagewise Additive Modeling each time. These might be fairly simple functions and fairly easy to update those gradient boosting using small trees to represent these functions. That's what happens in GBM .Gradient Boosting was introduced by Friedman in 2001. What it does is it works with a variety of different loss functions and includes classification, Cox Model, Poisson Regression, Logistic Regression, Binomial, and so on. What it does is it looks at this loss function and it evaluates the gradient of the loss function at the observations. Then approximates that gradient by a tree and uses that as the basis for this estimating this function.
Comparing and contrasting Bagging and Random Forests
So there is details that you can read about in our book and elsewhere. The one thing I wanted to tell you is that tree size turns out to be an important parameter in the story. I will explain that in the next slide. I guess in a few slides. Just Boosting on the spam data. That's the green curve. This is test Error. Number of trees. We are up to 2,500 trees. This is gradient Boosting with five node trees. That's four splits. This is Random Forest does slightly worse and this is Bagging. The one nice thing about Random Forests is that it does not overfit so you can't have too many trees in Random Forests. It just stabilizes. Cause the trees are all identically distributed. At some point, you don't get any benefit from averaging more, but you don't lose anything.
Boosting, you can certainly overfit you might even argue here that we starting to overfit. There is more tinkering with Boosting. There is some details on the spam. Both Random Forests and Boosting give you something called a variable importance plot. This is for the spam data again. There is 57 variables originally. What it's doing is showing you how important each of the variables are. Some are very important, some less. That's useful and some variables are left out completely.
Tree Size
Tree size: this is actually showing on the spam data the performance of Boosting with different numbers of splits so deeper and deeper trees. The remarkable thing is that stumps actually do the best. There is 10 node trees. There are a hundred node trees. This was actually Adaboost.
The way you understand that is: if you fit stumps, each tree only involves a single variable. If you collect the whole ensemble of trees, you can collect all those trees at split on variable, X1, clump them together. All those on X2, clump them together. What you will see is that you have got an additive model. No interactions. On the other end if you have two splits, you can each tree involve at most two variables. So that will give you at most second-order interaction models. The depth of the tree controls the interaction order of your model. I bet you all know this by now, but why did stumps do so well on the nested-spheres problem? Well, the decision boundary for nest is the surface of a sphere.
Learning Ensembles
That's a simple additive quadratic equation describes the surface of a sphere and stumps fits an additive model. In fact if you collect all those terms, they sure look like quadratic functions for each coordinate. Right? The last thing I am going to talk about in my overtime is the concept called Learning Ensembles. Both Random Forests and Boosting some are other build a collection of trees and then average them together or add them together with coefficients. You can think of it as two phases. The one is building up the dictionary of trees using whichever mechanism you use and then the other is adding them together. Now Random Forests just gives the trees each equal weight and adds them together with equal weight. Boosting does that too or modifies the weight with which you add them together. This suggests something that one can do to go beyond these methods.
That is to take the collection of trees that either of these methods gives you and try and combine them together in a slightly clever way as a post processing step. One way to do that is with the Lasso. The Lasso designed for linear regression. If you have got a bunch of variables and you want to shrink some of the coefficients to zero and set some to zero, the Lasso is a very popular method for doing that. What you do is you have an L one penalty on the coefficients. The idea is you take your collection of trees from Boosting. Now you can think of each tree as a transformation of the predictors. Cause if you take your training observations, the Xs, and pass each one down the tree, it's going to give you a number.
That gives you a single vector for the training observations. One tree is the evaluation of each of those training observations by that tree. If you have got a thousand trees, you're going to have a thousand vectors. Now you can think of those as variables in a linear regression model. You want to see how to optimally weight them and possibly leave a lot out. A lot of these trees are going to be very similar so you don't need to have a whole forest to do as well. This is a really nice way of doing that. This post-processing using the Lasso and this is just a little figure. It's in our book, which shows how you can benefit from that. The one real advantage is you can often actually get better performance at either the Random Forests or the Boosting model slightly better.
But the big advantage is that you can get rid of the vast majority of the trees. If you're actually going to use this model in production, you don't have to have a gazillion trees around. You can have a much smaller subset. I will just end with a slide which says there is nice software for fitting Random Forests and Boosting both in R and in H2O. H2O can handle very large data sets. That scales very large. It's very well. Thanks for your attention. I am happy to take this. Feel free to ask questions if you want. Otherwise I will be hanging around afterwards.
Audience Member:
A question in analysis algorithm is possible to off the original analysis?
Trevor Hastie:
To back off of the what?
Audience Member:
So you have got this Boosting algorithm, a set of trees that seem to be literally correlated to each other associated to the residuals. Has it been explored to back off the residual analysis in order to keep the gross root of that tree in order to just adjust the further analysis of the residuals so that you can get some efficiencies, say again, a time processing there fitting breast data?
Trevor Hastie:
It's a little hard for me to hear you because the mic's not working.
Audience Member:
Okay. Can you hear that okay?
Trevor Hastie:
That's better.
Audience Member:
Okay, great. Sorry about that.
Trevor Hastie:
So you're going to back off in some way.
Backing Off the Residual Analysis
Audience Member:
So, you have done the analysis on your data, your training set. And so you have got stumps that are somewhat literally correlated to each other as they are doing analysis on the residuals. Has there been any exploration in being able to back off the residual analysis up to a certain point that is grossly similar between certain sets or certain applications of this model? Where then you would be able to do the analysis only on the further residuals, the idea being that you could then get some efficiencies and post-analysis so that it's more efficient for real time processing.
Trevor Hastie:
These models are very open to tinkering. There is lots of ways you can tinker with these models. So if I understand correctly, you want to stop the Boosting and then do something else to the residuals. One thing you can do by is you could first fit a traditional model to your data, a linear model to get some strong effects and then take the residuals from that model. And now run Boosting on that. I am not sure if that's getting close to what you suggested.
Audience Member:
I guess what my point is that I am looking to see that if you have a gross model and your analysis begins to vary from what your new input data is that you're actually applying the model to. That you would be able to adjust on the fly with it. So you don't have to go back and start from a root analysis. You could back it off to a certain portion and then only reanalyze your residuals to a certain extent to allow more efficient application and realtime processes.
Trevor Hastie:
No, absolutely. That kind of adaptivity, depending on if you might have special data problems where things are changing, there is drift or maybe changing with time. Absolutely. You, in principle, you could just carry on learning. As residuals change it's inherently capable of adapting in that way, I think.
Audience Member:
Thank you very much.
Explainability for Forests
Audience Member:
In a lot of our commercial applications, we choose decision trees over other cause of it's explainability. The explainability you mentioned. Can you just comment on what's been done to make explainability for the Forests?
Trevor Hastie:
That's a little harder. Let's see. This is one of the main interpretability tools. This variable importance. This tells you which variables are the most important. I can't actually see here, but it looks like exclamation mark in spam is one of the biggest predictors of spam, right? Dollar sign is the next, right?
Audience Member:
I was thinking more along the lines of when you have a specific prediction.
Trevor Hastie:
Yes.
Audience Member:
Being able to explain to a user why that prediction.
Trevor Hastie:
That's always hard cause you have got a function of many variables here. What you're actually
trying to do is trying to explain the structure of a high dimensional surface. Boosting a Random Forests can fit fairly complex surfaces. There is things called partial dependence plots, which can say to first order, how does the surface change with age and how does it change with price and things like that? You can and those are sort of the main effect terms. What you're doing is you post process in the surface to try and summarize it in some crude way. Otherwise it's inherently hard to interpret, but deep trees are also hard to interpret by the way complicated.
Problems Boosted Trees Will Overfit
Trevor Hastie:
That's a good question. What sort of problems will boosted trees tend to overfit more versus not. I think the cases where they tend not to overfit are cases where you can do really well. We would say where the base error rate is close to zero. That's a case where the classes are potentially well separated. As long as you can find the right classifier then the Boosting tends not to overfit so much. If you're in a fairly noisy situation where the best you can do is get 20% errors. No procedure could do better than that. So that's a more noisy situation. In those cases you are liable to overfit cause if you train for longer and longer the Boosting algorithm will fit the noise. On the training error do much better than you're supposed to do.
Trevor Hastie:
It's still amazingly resilient to overfitting. It certainly can and it's not hard to construct examples like we did where it does. And so that when you train in Boosting you have to watch out for that. There is various tuning parameters. There is the number of trees. That's an important tuning parameter. There is the depth of the trees. That's a tuning parameter. There is also the amount that you shrink. Those are all three important tuning parameters which you have to take care to use something like cross-validation or left out data set to make sure you tune those properly. Whereas Random Forests, only one tuning parameter, which is the number of M, the number of variables you randomly select for splitting at each step. That's pretty much it.
Do Deeper Trees Fit Better?
Audience Member:
I, unfortunately had to step out right when you talked about stumps, so maybe you answered this. If you would, I've talked to a number of people who use GBM a number of ways. Most of them seem to think that deeper trees fit better. Your slides hinted that a large count of stumps seem to do better on the actual test data versus the train. You have any comments on like why this tension and the observations of practitioners versus what you're seeing?
Trevor Hastie:
That's a good point. Well, of course, we had a very fake example where the optimal solution actually required just an additive model, which is what stumps can deliver. In a case like that where you really want an additive model. If you go beyond stumps, it's just going to be fitting second-order interactions. If you use depths two, which is not going to help much, it's just going to add noise to the solution. In real problems, of course, that's not necessarily the case. In real problems, you might need fairly high-order interactions to solve the problem.
Audience Member:
Thanks.
Trevor Hastie:
You don't know in advance and by the controlling the depth is another way of controlling the variance of your solution.
Random Forests and Boosting Considering Tree Depth
Audience Member:
What I hear is that for Random Forests deeper trees can perform better. Where Boosting trees can do better with shorter depth. Can you describe the insights? Why is this the case?
Trevor Hastie:
Random Forests has no way of doing bias reduction because it fits its tree and all the trees are IID, right? The only way it can benefit from fitting many trees is to reduce the variance of the fit by averaging. Right? That's why if you need a model that's got quite a bit of complexity you will better grow a very deep tree so that it can get it the complexity. Even though it will get it with very high variance. You're going to get rid of the variance by averaging. On the other hand, Boosting is also going after bias. It's by definition looking for areas where it hasn't done so well and it's going to fix up there. That's why you can have shallow trees with Boosting because it can wait for later trees to fix up places where it hasn't done. This has to be the last question I am told.
Uncertainty of Results
Audience Member:
So in the Boosting method, is there a framework for quantifying the uncertainty of the results based on the uncertainty of the inputs?
Trevor Hastie:
Yes. So uncertainty analysis. There is been some recent work in quantifying the variance and therefore getting confidence intervals for predictions from Random Forests and still ongoing work on Boosting. That was something that was missing before. It's easier with Random Forests. I had a reference in the talk on my webpages. There is a reference with a student in our department Stephan Vaga had this very nice method for getting standard errors for Random Forests predictions with no extra work.
Audience Member:
Thank you.
Trevor Hastie:
Okay. Well, thank you very much.