Learn how to do a regression with scikit-learn. You can look into loading the boston housing dataset and use a random forest regressor to predict house prices. You can also learn the common API most of scikit-learn modules share.
- [Instructor] We are going to run a regression on Boston housing dataset. Regression is the process of learning to predict continuous values. The Boston dataset comes with scikit-learn, as well as several other datasets, to help us learn and understand algorithms. Let's start a new notebook. If I click up on New, Python three, and we'll rename this, the notebook boston. Now let's import. From sklearn, this is how scikit-learn is imported, dot datasets, we'll import load_boston.
And then Boston is load_boston. Let's see what we've loaded. We'll use the built-in type to see what's the type of Boston. It's a bunch. All toy datasets you are going to load from sklearn comes as bunches, and have common attributes. Bunches also behave like Python dictionaries, and this is how we're going to treat them. So let's look inside the dictionary and see boston.keys. You'll see you have data target feature names and description.
Let's see what's in the data. Type Boston data. And this is a NumPy array, which contains the features we'd like to learn from. If you'd like to know what these features are, we can look at feature name. Boston feature names. If you'd like to know what each feature name means, we can look at the description. Let's print out Boston description. And we can see that crim is per capita crime rate by town, and RM is the average number of rooms per dwelling, et cetera, et cetera.
The last key we saw is target, and this is the value we'd like our algorithm to learn. In our case, it's the price of the houses in thousands of dollars. There are several regression algorithms in scikit-learn. Let's go with Random Forest Regressor. So from sklearn.ensemble import randomforestregressor. We import from the ensemble sub model, since RandomForest combines several learners. There are other ensemble algorithms in this package, such as AdaBoost, for example.
Next, we'll create a model. CLF equals randomforestregressor. The variable is usually called CLF, short for classifier. In our case, it's a regressor, but that's okay. Next we train our classifier. We'll give it the features, which are in the data key, and the labels, which are in the target, clf.fit Boston data, and Boston target.
And we see that our model learnt something. How well did it do? We can use the score method. So, clf.score on boston.data and boston.target. What does this number mean? Is it good or bad? Well, it depends on the algorithm. Each algorithm has a different scoring algorithm, that makes sense for it. In our case, we can check. CLF score. We see that it's coefficient of determination score, which is very common in regression, and our score is almost perfect.
You might see a slightly different result. This is due to the fact, as the name suggests, that Random Forest Regressor uses some randomization. Scikit-learn have many other estimators you can use if you think another estimator is better to your case. What did our model learn? In scikit-learn, the convention is that all attributes that end with an underscore are learned. Let's take look. We'll use the built-in dir command, and we see that we have several attributes that end with a single underscore.
For example, clf.nfeatures is the number of features. Let's see if it matches. Boston data, and the shape. And we see that we have 13 features as well here. Once we've trained our model, we'd like to use it to make predictions. Let's pick a row and see what our model predicts. We're going to use the predict method. Predict can work on several rows, so if you grab just one row, we'll need to reshape it.
Row equals Boston, data, and we'll take number 70. Let's take a look at the shape. We'll need to reshape it, so we'll do row.reshape, and say minus one and 13. This way we'll get one row. So let's look at the prediction. clf.predict, row.reshape, minus one, 13. And we got 16.46.
Let's compare it to the actual price. Boston target, and we were at 70. So not that far. That covers most of the basic functionality of scikit-learn. Most of the models and algorithms behave the same. You call fit to train them, check accuracy with score, and use predict to predict the outcome.
- Working with Jupyter notebooks
- Using code cells
- Extensions to the Python language
- Markdown cells
- Editing notebooks
- NumPy basics
- Broadcasting, array operations, and ufuncs
- Folium and Geo
- Machine learning with scikit-learn
- Plotting with matplotlib and bokeh
- Branching into Numba, Cython, deep learning, and NLP