From the course: Building Recommender Systems with Machine Learning and AI

K-nearest neighbors (KNN) and content recs

- [Instructor] The other thing we said we wanted to consider was the release years of each movie. Extracting this from the MovieLens data is a little tricky, but the information we want is in the movie titles. They include the year at the end of every title in parentheses, so we just have to do a little bit of string wrangling in our code to extract that data. How do we assign a similarity score based on release years alone? Well, this is where some of the art of recommender systems comes in. You have to think about the nature of the data you have and what makes sense. How far apart would two movies have to be for their release date alone to signify they are substantially different? Well, a decade seems like a reasonable starting point. I mean, sci-fi movies in the 70s look pretty different from sci-fi movies in the 80s, for example. So, we want to start with a difference in release years for two movies. Just the absolute value of the difference. It doesn't matter which one came first. Now, we need to come up with some sort of mathematical function that smoothly scales that into the range zero to one. I chose an exponential decay function. It ends up looking like this. Just look at the right side of the graph, since we're taking the absolute values of the year differences, which makes them all positive. At a year difference of zero, we get a similarity score on the y-axis of one, which is what we want, and the similarity score decays exponentially, getting pretty small at around a difference of 10 years, and almost nothing at 20. The choice of this function is completely arbitrary, but it seems like a reasonable starting point. In the real world, you test many variations of this function to see what really produces the best recommendations with real people. So, how do we turn these similarities between movies, based on their attributes, into actual rating predictions? Remember, our recommendation algorithms in surpriselib have one job: predict a rating for a given user, for a given movie. One way to do this is through a technique called k-nearest neighbors. Quite honestly, it's an unnecessarily fancy name for a really simple idea. We start by measuring the content-based similarity between everything a given user has rated, and the movie we want to predict a rating for. Next, we select some number, call it k, of the nearest neighbors to the movie whose rating we're trying to predict. You can define nearest however you like, so in our case, we'll say the nearest neighbors are the ones with the highest content-based similarity scores to the movie we're making a prediction for. So, we could, for example, select the 40 movies whose genres and release dates most closely match the movie we want to evaluate for this user. That's it, that's really all there is to the concept of k-nearest neighbors. It's just selecting some number of things that are close to the thing you're interested in, that is its neighbors, and predicting something about that item based on the properties of its neighbors. So, to turn these top 40 closest movies into an actual rating prediction, we can just take a weighted average of their similarity scores to the movie whose rating we're trying to predict, weighting them by the rating the user gave them. That's all there is to it. Let's turn that concept into code. This is the meaty part of our prediction function, which takes in a user, u, and an item, i, that we want to predict a rating for. We start with a list called neighbors, and go through every movie the user has rated, populating it with a content-based similarity score between each movie and the movie we're trying to predict. We precomputed these similarity scores in the self.similarities array. Next, we use heapq.nlargest to quickly and easily sort that list into the top k movies with the highest similarity scores to the movie in question. After that, it's just a matter of computing the weighted average of the top case similar movies, weighted by the ratings the user gave them. Assuming we had some data to work with there, we return that as our rating prediction for this user and item. So, let's actually play around and run this thing, and generate some real recommendations just using content-based filtering. We're just going to recommend movies that are similar to movies each user liked, based only on their genre and release date. Let's see how well that actually works.

Contents