Join Barton Poulson for an in-depth discussion in this video Ensembles, part of Data Science Foundations: Fundamentals.
- [Voiceover] One of the most important developments in machine learning is the use of Ensembles, or Ensemble modeling. You can think of this as the statistical version of the Wisdom of the crowd. The idea here is that you're going to combining estimates, that is, the average of many different estimates of your particular outcome. The reason you wanna do this is because the combined estimates are often more accurate than any one individual estimate. So for instance, in contests where people are asked to guess the number of marbles in a jar, if you get the average of everybody's estimate, 'cause some people overestimate, some underestimate, that average is usually closer to the true number than any one person's guess.
Now, when Combining Estimates, there's one other thing to remember. Diversity helps, and I'll explain a little more about that. There are a few different methods of Combining Estimates. Three of the most common are, first, Bagging, which is short for boosted aggregating. This is where you use randomly drawn datasets. You build your model on those, you get the predictions, and then you combine the predictions, sort of on a voting process from each of those models. Boosting is where each classifier puts greater weight on the previous classifier's errors, the cases that it couldn't categorize well.
So they get more emphasis. The third one is Blending, also known as stacking. This is really where you use a second order model to combine the results of the first order models, and so you see how things get a little more sophisticated here, although interestingly, Bagging tends to be a very useful approach. Let me give you an example of this in R. I'm going to use a package in R called randomforest, and if you don't have that installed, you can install it. I'm also going to use the built-in datasets package so I'm gonna load both of those.
We got a little bit of news there about the package. I'll scroll down, I'm going to use the same iris datas that I used before, so let's take a look at the head. Four quantitative measurements, and then the Species of iris. What I'm going to do now, and this is an important part of this, is I'm going to split the data into two parts. I'm going to create a training set with two thirds of the data, and a testing set with one third. I'm gonna create a random seed here so I can get consistency or repeatability in my randomness there. And now I've got a split, and I'm going to create the two datasets.
There's the training set and there's the testing set. Then I'm going to create the actual randomforest of decisions trees, and again I'll set up a random seed here. Then I use the random forest function. So I'm going to feed that through, I'm going to do 500 trees. We'll compute the proximity. Now let's take a look at the results of that random forest. First off, let's print the results of the classification table. First we have the Call, the actual command. Then it did classification, then it did 500 trees, then it did 2 variables tried at each split.
We have a very low error rate, 3.67%, and then you can see the Confusion matrix, or the misclassified cases. The setosa were very easy to distinguish, the versicolor, two of them were predicted as virginicas, and two of the virginicas were predicted as versicolor. Let's go back to our script here. Let's make a plot of the error by the number of trees. So, I'm gonna just click right here and here's the plot, I'll zoom in on this. What you see here is, you want a low line, and that's less error.
And you can see it's sometimes stable, but because it's pulling out random data and combining them, things do fluctuate, but you can see, after about 350 trees, it's smooth then on out. That's one of the advantages of doing this many times over, even though it's a random process. Another thing we can use, we can look at the relative importance of the predictor variables. I'm gonna use the importance function and there it is, and then I'm gonna get a plot of that same information, and you can see from this, I'll zoom in.
From this you can see that Petal Length was the most important predictor in terms of decreasing one common measure of error, the Gini. And then, Petal Width, and then Sepal Length then was less important, and Sepal Width made almost no difference. And then we can take that model and apply it to the testing data that we set off to the side. I'll apply it right here, and then we'll get a table for the new ones. Here you can see once again, the setosa were very easy to categorize. They're all correct. The versicolor in the second column, two of the eight got miscategorized, and the virginica, one of the thirteen got miscategorized.
And these results are pretty consistent with what we've had with our other analyses of the iris data. So what are our Conclusions about Ensemble Modelings, including random forests? First off, many estimates often beat, in terms of accuracy, just one estimate. Second, the diversity of data, meaning random selection of the data and diversity of models both produce more accurate estimates. Finally, randomness in terms of the selection of cases and the selection of variables, plays an important role in actually getting the diversity, which makes for a more predictive and accurate Ensemble of Models.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data