Learn about Instance-based learning with k-Nearest Neighbor.
- [Narrator] K-nearest neighbor classification is a supervised machine learning method that you can use to classify instances based on the arithmetic difference between features in a labeled data set. In the coding demonstration for this segment, you're going to see how to predict whether a car has an automatic or manual transmission based on its number of gears and carborators. K-nearest neighbor works by memorizing observations within a labeled test set to predict classification labels for new, incoming, unlabeled observations. The algorithm makes predictions based on how similar training observations are to the new, incoming observations.
The more similar the observation's value, the more likely they will be classified with the same label. Popular use cases for the k-nearest neighbor algorithm are stock price prediction, recommendation systems, predictive trip planning, and credit risk analysis. The k-nearest neighbor model has a few assumptions. Those are that the data set has little noise, that it's labeled, that it contains only relevant features, and the data set has distinguishable sub groups. You want to be sure to avoid using k-nearest neighbor algorithm on large data sets, 'cause it'll probably take way too long.
Let's use Python to apply the k-nearest neighbor algorithm. We'll start by importing our libraries. So we're going to need NumPy, Pandas, and SciPy for this demonstration. We'll also import nat plot lib, for our data visualization. And to read our data in we're going to use URL lib. So we'll say import URL lib, and then for the modeling itself we'll use sci kit learn. So we'll say import, SK learn, and from SK learn's neighbor module, we want to import k neighbors classifier.
So we'll say import SK learn.neighbors, and then we want to import k neighbor classifier. Let's be sure to import our neighbors module itself. So we'll say from SK learn import neighbors. For our pre processing of our data, we want to import the pre processing module, so from SK learn import pre processing, and we also want to import the train test split function.
I'm going to show you how to use this to split your data into test and training sets. So, we say from SK learn import.cross validation. That's the module that has the tool, and we want to import train_test_split, and to evaluate our model, we'll import the sci kit learn metrics. So, from SK learn import metrics, run that.
On the fifth line up from the bottom it should be from SK learn neighbors, import k neighbors classifier. And when we run that we've got all our libraries. Now let's set our plotting parameters for the Jupyter notebook. Like I said, we're going to use our antique cars data set, so we'll load that like we have been throughout this course, and then to use k-nearest neighbor, you should have a labeled data set.
We do. We're going to use the AN variable as our target. This variable label is a car's either having an automatic transmission or a manual transmission. For this analysis, we're going to use the variables MPG, displacement, HP, and weight as predictive features in our model. We're going to build a model that predicts a car's transmission type based on values in these four fields. I picked these variables because they each hold information that's relevant to whether a car has an automatic or a manual transmission, and because they each have distinguishable sub groups.
So we'll call our sub set X prime, and we'll set that equal to cars.ix, and we'll use a special indexer to select and retrieve our columns, and in this case those are columns, one, three, four, and six, and then we'll say .values, 'cause we want to access the values in those columns. We also need to set our target variable. We'll call that Y, and we'll say Y is equal to cars.ix, and the AN variable is the column with the index number nine.
So we'll say nine, and then .values. Let me run that. Before we can implement the k-nearest neighbor algorithm we need to scale our variables. So, we'll create a scale data set, and we'll call it X, and then we'll use sci kit learns pre processing tools. We'll use the scale function. So pre processing.scale, and then we'll pass in our X prime object. That scales our variables. Now I'm going to split the data into test and training sets.
We use the training set for training the model, and the test set for evaluating the model's performance. To do this, we'll use sci kit learns selection tools, and we'll use the train test split function. The train test split function breaks the original data set into a list of train test splits. So we'll say train_test_split, and we'll pass in our X data and our Y data, and for the model outputs, those are going to be X train, X test, Y train and Y test.
We also need to specify some model parameters. We'll say test_size equal to .33. This tells the function that we want to split our data, so that 33% of it goes into the test set, and 77% of it goes into the training set. And let's also pass in the parameter random_state equal to 17. Since the function splits the data randomly, we need to set the seed by passing this argument in, and that will allow you to reproduce the same results as you see here on my computer.
Okay, and we run that. Now, let's build our model. The first thing we need to do is instantiate a k- nearest neighbor object. We'll call is CLF, and we'll set that equal to neighbors.k-neighbors classifier. Next we call the fit method off of the model, and pass in X train as our training data, and Y train as our target variable. We say CLF.fit and then X_train, Y_train.
Then let's just print it out. I missed an I in classifier, so let me add that, and now what we see here is our model parameters all printed out. Now let's evaluate the model's predictions against the test data set. Just to make this easier to explain, I'm going to rename our Y test set to Y expect, representing our expected label values. So Y_expect equals to Y test.
Then I'm going to create another variable called Y pred. This variable is going to contain the labels that our model predicts for the Y variable. So we'll say Y pred, and then we'll write the name of our model, and we will call the predict method off of it, and pass in our test data set, so, X_test To score the model I'll use sci kit learns classification report function. That's part of the metrics module. So we'll say metrics.classification_report, and we'll pass in Y expect and Y pred, and then let's just print this whole thing out.
So we'll call the print function on the whole thing, and there we have some model results. Now I'm going to take you into the other screen to show you what those mean. As you remember from the k-means demonstration, recall is a measure of a model's completeness. What these results are saying is that of all the points that were labeled one, only 67% of those results were returned were truly relevant. And of the entire data set, 82% of the results that were returned were truly relevant.
High precision and low recall generally means that there are fewer results returned, but many of the labels that are predicted are returned correctly. In other words, high accuracy, but low completion. That's it for instance based learning. Hold on, because next I'm going to show you how to use Python for a network analysis.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup