From the course: Building Recommender Systems with Machine Learning and AI

User-based collaborative filtering

- [Instructor] Let's talk about one specific implementation of neighborhood-based collaborative filtering, user-based collaborative filtering. It's the easiest one to wrap your head around, so it seems like a good place to start. The idea behind user-based collaborative filtering is pretty simple. Start by finding other users similar to yourself, based on their ratings history, and then recommend stuff they liked that you haven't seen yet. So, let's say that Ann and Bob here both loved the first Star Wars movie and also watched the next Star Wars movie, The Empire Strikes Back. But Bob lives under a rock and hasn't seen that yet. So, it seems kind of obvious to recommend The Empire Strikes Back to Bob. If Ann loved the same movies Bob loves, and Ann loves The Empire Strikes Back, then there's a good chance Bob will love it, too. So, how do we do this? Well, the first step is to collect the data we need; that is, a table of all of the ratings for everyone in our system. We can think of this as a 2-D array, with movies on one axis, users on the other, and rating values in each cell. So, we can see here that Ann loved both of the first Star Wars films, Bob only saw the first one and loved it, and they each saw other stuff as well, that they don't have in common. Ann might enjoy Indiana Jones from the looks of it, since she has that connection to Bob through Star Wars. So, now, it's not too much of a stretch to imagine this data as describing vectors for each user and item space. If these were the only five movies in existence, we could describe Bob with a five-dimensional vector of four, five, zero, zero, zero; and Ann with zero, five, five, five, zero; where zero indicates a missing value. So, that gives us what we need in order to compute the cosine similarity score between any two users. Of course, we can experiment with different similarity metrics, too, but let's start with cosine. So, now, we can build up another 2-D matrix, one that maps the cosine similarity score between any two users in our system. If our only three users are Bob, Ted, and Ann, it might look like this. Take a close look, so you can understand what's going on here. This will all just fall out of the algorithm for measuring cosine similarity, but you can see that it makes intuitive sense looking at the results. Everyone is 100% similar to themselves. So, if we look up Bob's similarity to Bob, for example, we get 1.0. And that's true for anyone who has rated at least one thing. Bob and Ted have no movies in common at all that they both rated, so they end up with a similarity score of zero. Note there is some symmetry here. The combination of Bob and Ted can be looked at separately from Ted and Bob, but it's the same score either way. We can exploit that symmetry to avoid computing half of the values in this table. Bob and Ann are the more interesting example. While they have different sets of movies that they rated, when we measure similarity, we only look at the movies they have in common. In this case, they only have one movie in common, Star Wars; and since they both gave it the same rating of five stars, they, too, get a similarity score of 1.0, or 100%. I'd like to point out that saying two users are 100% similar doesn't necessarily mean they love the same things. It can also mean that they hate the same things. If Bob and Ann both rated Star Wars one star, they would still be 100% similar. In fact, the math behind cosine similarity works out such that, if you only have one movie in common, you end up with 100% similarity no matter what. Even if Bob loved Star Wars and Ann hated it, in a sparse data situation, they'd both end up 100% similar. I'm just reinforcing that sparse data is a huge problem with collaborative filtering in general, and it can lead to weird results. And sometimes you need to enforce a minimum threshold on how many movie users have in common before you consider them at all to avoid weird stuff like that. So, let's say we want to generate recommendations for Bob now. We can use this handy matrix that we precomputed to quickly look up how similar he is to everybody else. Then we can sort that list by similarity scores and pick off the top-end neighbors from that list. In this case, we end up with Ann at the top of the list, and we'd probably discard Ted by enforcing some minimum similarity score threshold at this stage. We also skip over the case of comparing Bob with himself. In a less sparse data situation, you'd have more good results to work with besides just Ann, of course. So, now, we can pull all of the movies Ann and everybody else from Bob's top-end user similarity results and consider these recommendation candidates. Still with me? In this simple example, Ann is the only user who is similar to Bob. But if there were other users that we found, we'd throw their ratings in this pile as well. Next, we need to figure out which of these recommendation candidates are the best ones to present to Bob, so we need to score them somehow. There are different ways to do this, but it seems reasonable to take the rating assigned to each candidate as part of it. We want to recommend things that similar users loved, not things that similar users hated. So, one component could be some sort of normalized rating score here. Maybe we translate a five-star rating to 1.0, just to keep everything on the same scale. We should probably also take the similarity with the user who generated these ratings into account, so maybe we can multiply that rating score of 1.0 by Ann's similarity score to Bob, which is also 1.0. So, we end up assigning a candidate score of 1.0 to both of these items. You might also encounter the same movie more than once, if more than one similar user rated it. In this case, you'd probably want to strengthen the relationship to that movie by adding it in again to the final score for that movie. Again, there are variants on how to do this. Maybe instead of normalizing rating scores on a zero to one scale, you can actually assign negative scores to things rated one or two stars to actively push them down in the final results. There are a lot of different ways to score recommendation candidates, and there's no standard way of doing it. It's another case of where you need to experiment to see what works best for the data you have. Next, we just sort the recommendation candidates by their final scores. In our simple example, we have a two-way tie of 1.0. The last step is to filter out stuff Bob already rated, since there's no point in recommending movies he's already seen. You might filter things out in other ways too. Maybe there is some minimum candidate score below which you won't present a result to a user no matter what. Or maybe you want to filter out results that might be offensive based on their content. That leaves us with The Empire Strikes Back, which hopefully makes Bob a happy user of our user-based collaborative filtering system. So, to recap, the steps involved in user-based collaborative filtering are these: Start by building up a lookup table of users to all of the items they rated and those rating values. Then build up another 2-D matrix of similarity scores between every pair of users. At this point, you've done all of the computationally intensive stuff and can just reuse these tables to generate recommendations quickly for anybody. When we want recommendations for a specific user, we can look up all of the top similar users to that user. Next, we generate recommendation candidates by pooling together all of the stuff those similar users rated. We then score all of those candidates by looking at how the similar users rated them, how similar the user rating them was to you, and whatever else you want. Finally, we filter out stuff the user has already seen, and we're done.

Contents