Join Barton Poulson for an in-depth discussion in this video k-nearest neighbors (kNN), part of Data Science Foundations: Fundamentals.
[Voiceover] One very common method for classifying cases is the k-Nearest Neighbors. The idea here is simply to use neighborhoods or the neighboring cases as predictors on how you should classify a particular case. k-Nearest Neighbors, or k-NN, where K is the number of neighbors is an example of Instance-based learning, where you look at the instances or the examples that are around a particular case. Because it uses those instances and doesn't rely on parameters, it's sometimes called a Lazy Learner.
Really what that means is, it's a very simple algorithm, very easy to describe conceptually, not difficult to implement and surprisingly effective. The first step in determining neighbors is to determine some sort of a distance metric. How far away are the cases from each other? Now if you have a certain number of predictor variables, we'll call that number J, then one thing you can do, is you can calculate the Euclidean distance in a j-dimensional space. That's usually if you have quantitative variables.
On the other hand if you have categorical variables you might use an overlap metric, the hamming distance. Also, you have to be worried about something called the combinatorial explosion, and the idea here, is if you have a large number of variables, that gives you a lot of dimensions. You may want to reduce the dimensions, before doing the k-Nearest Neighbors, using something like factor analysis or principal component analysis. Next is how do you choose K, the number of neighbors that you're going to work with. Well, the idea here, is that more neighbors gives you sort of smoother model, irons out a lot of the kinks, but more also gives you a risk of including random noise.
It actually can increase the probability of misclassification, so there are trade-offs here. One option is to weight the neighbors, where closer cases get more value, they have more influence, and cases that are further away have less, or you can use one of the many variations on k-Nearest Neighbors, Things like extended k-Nearest Neighbors, that's ENN, or condensed nearest neighbors, CNN. There's a lot of options available, but let's take a look at a very simple example in R.
What I'm going to do in this example, is I'm going to use the Iris data again, and I'm going to use a library called Class. That's for classification. So you'll want to load that, and I'll also use a dataset, the Iris dataset, so you'll want to load that package. The first thing is to take a quick look at your data. I'm going to pull up the first six cases of the Iris data again. Now, if your data are on very different scales, if the range for variables are different, it's a good idea to normalize the variables. What that does, is it puts them in similar ranges.
It standardizes the variables. On the other hand, with the Iris data, where things are all pretty close to each other, that isn't necessary, right now, so I'm going to skip that step. As I've done before, I'm going to split the data into a training set, with two-thirds of the data and a testing set, with one-third. I'll use a random seed for reproducibility. And what I'm going to to now, is I'm going to split the cases into the actual four quantitative variables, and then I'm going to remove the actual species from that. So, that's what I'm having here with the first four variables, the one to four you see at the end.
That creates my training and my test set. On the other hand, I do need the labels for classification, so I'm going to save those as separate data sets. There we go. Now I'm going to actually build the classifier. You get to choose the number of K that you have, and generally, you want to use an odd number that avoids ties, and you can try several different values of K and see how it looks. We'll try a few different kinds here. So, I'm going to come down here to my predictor model, where I'm using the k-NN function, and I'm going to use an initial value of three.
Just look at the three closest neighbors, and so I've got that model, and if I want to look at how well it classifies it, I do this one table, and what I see is that all 17 of the setosa irises were classified correctly. Two of the versicolors, out of ten, were misclassified and one of the virginica was misclassified. Let's try changing the value of K, run this again, and compare the results. I'm going to scroll back up, and I'll change this to a bigger number, maybe nine.
I'll run this command again. Highlight the whole thing, and then we'll just do the table over again, and see how it compares. Make this bigger so you can look at the two, and what we've done here is we've got a slight improvement in classification, one less versicolor is misclassified. Now I've done this before, and I know that increasing the number of K does not improve the classification with this particular dataset and based on our previous experience with the Iris dataset, we know there's a certain amount of inherent misclassification, but this gives us a good idea, of how the algorithm works, and so based on this small experience, what can we conclude about k-NN or the k-Nearest Neighbors model? First, k-NN is conceptually simple.
It's not difficult to describe, and it's a nonparametric classification method. That's because it simply looks at what's around it, and uses those as its data. On the other hand, it's important to remember that the choice of K as well as the choice of your distance metric impacts the results, so look at some of the options, compare the results and make the best choice for your data.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data