From the course: Building Recommender Systems with Machine Learning and AI

Top-N hit rate: Many ways

- [Instructor] So, in hindsight, maybe Netflix should have based the Netflix price on a metric that's more focused on top end recommenders. It turns out there are a few of them. One is called hit rate, and it's really simple. You generate top end recommendations for all of the users in your test set. If one of the recommendations in a user's top end recommendations is something they actually rated, you consider that a hit. You actually manage to show the users something that they had found interesting enough to watch on their own already, so we'll consider that a success. Just add up all of the hits in your top-end recommendations for every user in your test set divide by the number of users, and that's your hit rate. Hit rate itself is easy to understand, but measuring it is a little bit tricky. We can't use the same train test or cross validation approach we used for measuring accuracy, because we're not measuring the accuracy on individual ratings. We're measuring the accuracy of top-end lists for individual users. Now you could do the obvious thing and not split things up at all, and just measure hit rate directly on top end recommendations, created by a recommender system, that was trained on all of the data you have. But, technically, that's cheating. You generally don't want to evaluate a system using data that it was trained with. I mean think about it. You could just recommend the actual top ten movies rated by each user, using the training data, and achieve a hit rate of 100%. So a clever way around this is called leave-one-out cross validation. What we do is compute the top end recommendations for each user in our training data, and intentionally remove one of those items from that users training data. We then test our recommenders system's ability to recommend that item that was left out in the top end results it creates for that user in the testing phase. So we measure our ability to recommend an item in a top end list for each user that was left out from the training data. That's why it's called leave-one-out. The trouble is, it's a lot harder to get one specific movie right, while testing, than to just get one of the end recommendations. So hit rate with leave-one-out tends to be very small, and difficult to measure, unless you have a very large data sets to work with. But it's a much more user-focused metric when you know your recommender system will be producing top end lists in the real world, which most of them do. A variation on hit rate is average reciprocal hit rate, or ARHR for short. This metric is just like hit rate, but it accounts for where in the top end list your hits appear. So you end up getting more credit for successfully recommending an item in the top slot, than in the bottom slot. Again, this is a more user-focused metric, since users tend to focus on the beginning of lists. The only difference is that instead of summing up the number of hits, we sum up the reciprocal rank of each hit. So if we successfully predict a recommendation in slot three, that only counts as one-third. But a hit in slot one of our top end recommendations receives the full weight of 1.0. Whether this metric makes sense for you, depends a lot on how your top end recommendations are displayed. If the user has to scroll or paginate to see the lower items in your top end list, then it makes sense to penalize good recommendations that appear too low in the list, where the user has to work to find them. Another twist is cumulative hit rank. Sounds fancy, but all it means is that we throw away hits if our predicted ratings below some threshold. The idea is that we shouldn't get credit for recommending items to a user that we think they won't actually enjoy. So in this example, if we had a cutoff of three stars, we'd throw away the hits for the second and fourth items in these test results, and our hit rate metric wouldn't count them at all. Yet another way to look at hit rate, is to break it down by predicted rating score. It can be a good way to get an idea of the distribution of how good your algorithm thinks recommended movies are, that actually get a hit. Ideally, you want to recommend movies that they actually liked. And breaking down the distribution gives you some sense of how well you're doing in more detail. This is called rating hit rate, or rHR, for short. So those are all different ways to measure the effectiveness of top end recommenders offline. The world of recommender systems would probably be a little bit different if Netflix awarded the Netflix prize on hit rate, instead of RMSE. It turns out that small improvements in RMSE can actually result in large improvements to hit rates, which is what really matters. But it also turns out that you can build recommender systems with great hit rates, but poor RMSE scores. And we'll see some of those later in this course. So RMSE and hit rate aren't always related.

Contents