Join Lillian Pierson, P.E. for an in-depth discussion in this video Evaluating similarity based on correlation, part of Introduction to Python Recommendation Systems for Machine Learning.
- [Instructor] The next type of recommendation system to look at is correlation-based recommendation systems. These recommenders offer a basic form of collaborative filtering. That's because with correlation-based recommendation systems items are recommended based on similarities in their user review. In this sense, they do take user preferences into account. In these systems, you use Pearson's R correlation to recommend an item that is most similar to the item a user has already chosen.
In other words, to recommend an item that has a review score that correlates with another item that a user has already chosen. Based on similarity between user ratings. Just to refresh on Pearson R, the Pearson R correlation coefficient is a measure of linear correlation between two variables, or in this case, two items ratings. The Pearson correlation coefficient is represented by the symbol R and with an R value that's close to one or negative one than you know you have a strong linear relationship between two variables.
As R values get closer to zero, you know that the two variables are not linearly correlated. Correlation based recommenders use item-based similarity. That is, they recommend an item based on how well it correlates with other items with respect to user ratings. Let's look at the logic of this. Check out or mystery shopper here. Shopper D. We see that she has already chosen and reviewed the camera. She gave it a rating of four stars. Now let's see who else reviewed the camera.
It looks like users A, B, and C also reviewed the camera, but now let's take a closer look. Look at the ratings each of these users gave. User A gave a four stars, user B gave four stars, and user C gave 2.5 stars. Based on correlations between user ratings, we'd say that user A's and user B's ratings are more similar to or more highly correlated with user D's ratings. Now let's look at what other items user A and user B liked.
The both gave pretty good ratings to the printer. So based on how well user A's and user B's review scores correlate with user D's review scores of the camera, and based on the shared preferences user A and user B have for the printer, we would recommend the printer to user D as well. Let me show you how to do this in the Jupyter notebook. Now let's practice making recommendation based on the Pearson correlation. What you are about to see is an example of item based recommendation system, because the recommender will compare items based on user reviews.
Actually though, in the dataset we are going to use, the items are different places to eat and the users are restaurant goers. Making recommendations based on correlation is a simple form of collaborative filtering, or user to user filtering. Because items are recommended based on similarities in user reviews. You'll see what I mean in a second, but first let's import our libraries. So we're going to import numpy as NP and import pandas as PD.
From that we have our libraries. Now the datasets that we're going to use actually come from Mexico. These datas are hosted at the University of California Irvine Machine Learning site, but they were originally published by Blanca Vargas ETAL, and you can see the citation here. So the first thing we need to do is read the datasets into our Jupyter notebook, and we'll do that by calling the read_SCV function.
So we'll call the first one frame and we'll say pd.read_csv, and we'll pass in a string with our file name. Which the first file is going to be rating_final.csv. Now you're going to get these datasets with the download for the course. The second data frame we're going to create is called cuisine. We need to pass in the name of the file, and that's kind of strange.
Chefmozcuisine.csv, and our last data frame will be called geodata pd.read_csv and then our file is named geoplaces2.csv. Okay, when I run this, we have our data frames.
Now let's take a quick look at the first few records in the frame data frame. To do that we say frame.head. This is just a small sample. So let me explain that each of the places in the dataset gets a rating of either zero, one, or two. Where two is the best and zero is the worst rating. And looking at the head here you can see that user IDs are in duplicate. That happens when a user has reviewed more than one place. Now let's check out our geo data, data frame.
So we'll say geodata.head. The reason that we want this dataset is that it provides a name for each of the unique places that's been reviewed, but since we don't need all of the attributes in this data frame, let's subset it down to only place ID and name. We'll call this subset places. Places is equal to geodata, and then let's just select these two columns.
The first one being place ID and then the second one will be name. Now let's look at the head of this. Places, okay so now we have each of our place IDs and the name of the restaurant that goes with that place ID. Lastly I want to check out the cuisine dataset. Just to see a sample of what that looks like. So we call the head function, and then here we go.
Okay so we have place ID and cuisine type. Great. Now let's look at the ratings these places are getting. To do that, we will look at the mean value of all the ratings that are given to each place. So let's create a new data frame. We'll call the data frame constructor. And let's call this new data frame rating. Now we want the rating data frame to be generated from our frame data frame, but we wat to take our frame data frame and group it by place ID.
So we'll say group by place ID, and then for each place ID we want to look at the rating column, and we want to generate the mean value for each of the ratings that was given to each place. And let's print out the head of this to see what it looks like. Great, so we've got each of our places and then the average rating that each of the places was given. In addition to the mean value we also want to look at how popular each of these places was.
So to do this, let's add a column called rating count, and then within ghat column we'll generate counts for how many reviews each place got. So we'll say rating and then add a column called rating count. Add the T here, and then we're going to call the data frame constructor. And we're going to say frame.groupby.
We want to group by place ID, again. And then for the rating column. This time we want to taka a count of how many ratings were given. And let's print this out. What we've got here is we've got each of the place IDs with their average rating and then the rating count, the number of ratings that each of these places got.
Now let's look at a statistical description of this rating data frame. To do that, we'll use the describe methods. So we'll say rating.describe. And when we run that, we've got a statistical description, and what I want to point out here is that for the count, taking a count of the rating data frame we get 130, and that indicates that there are 130 unique places that have been reviewed in the rating data frame, and also I want to point out here that you see the max value for rating count comes out to 36.
What this means is that the most popular place in the dataset has got a total of 36 reviews. To see what place that is, all we have to do is sort our dataset in descending order. So we'll say rating.sort_values and then we want to sort it according to rating count and we pass in an argument that says ascending equal to false.
So that we get our results in descending order. And then let's just look at the first five records. So we'll fall the head method off of that and we run this, and we see that our most popular place has got a place ID of 135085. That's kind of an obscure way to refer to a restaurant. So let's find the name of this place. In order to do that, I'm going to create a filter, and what this filter is going to do is find a true value for where the place ID is equal to 135085, and then we're going to filter our places data frame to return only the record where that's true.
So let's create our filter first. We're going to say places where place ID is equal to 135085, and where this expression is returned as true we want to get that record from the places data frame. So we say places and we run it, and here we go. We've got the name of the place.
It's called Tortas Locas Hipocampo, and I'm going to refer to this as Tortas for short. Let's also look at the type of cuisine this place serves. We'll use the same filtering process. So we'll say cuisine where place ID is equal to 135085, and we're filtering this from our cuisine data frame.
So rewrite that. And when we run this, it looks like I forgot one of the equal signs. So I need to add that, and I run it, and we can see here that Tortas, the restaurant Tortas serves fast food. Okay, so that's good to know. The next thing we need to do is to build a user by item utility matrix. To do that we're going to call the pivot table function. This function will cross tabulate each user against each place, and output a matrix.
Let's call this matrix places_crosstab, and then we're going to call the pivot table function. So that's PD.pivot_table, and our data is going to be the frame data frame. The values we're interested in are the values from the rating column, and our index is going to be our user ID.
So that's index equal to user ID and let's name our columns place ID. Now let's look at the first five records of places cross tab. So we say places_crosstab.head. Now the first thing you'll notice about this cross tab is that it's full of nul values.
That's because people never review that many places. Just a few people review just a few places. Hence the sparsity of this matrix. You do see some numbers here, and these numbers are the ratings that each user gave to the respective place that they did review and cases where they made up restaurant review, and you might be thinking that this matrix can't be very useful because it's got so many nul values, but let me show you how we can use it to find places that are correlated.
Before we do that, we need to first isolate the user ratings from our restaurant called Tortas. So we'll say Tortas_ratings. We're going to create a series here, and we'll that from the places cross tab. We want to select the column that's indexed with the number 135085. Let's also filter Tortas ratings so that we can see only the non nul values.
As you recall, Tortas is the most popular place with 36 ratings. So let's get a look at what those ratings are. So we'll say Tortas_ratings and then create a filter where Tortas ratings are greater than or equal to zero. And when we run that, here we've got 36 review scores and they range between zero and two, perfect.
Now to find correlation between each of the places and the Tortas restaurant, what we'll do is call the core with method off of our places cross tab, and then pass it the Tortas rating series. What this will do is generate a Pearson R correlation coefficient between Tortas and each other place that's been reviewed in the dataset. Keep in mind that this correlation is based on similarities and user reviews that were given to each place. So we'll say places_crosstab and then we want to call the core with method, and we pass in Tortas ratings.
And then let's call this whole thing similar to Tortas 'cause we're looking for the places that are similar to Tortas. Similar to Tortas is going to be returned as a matrix, and we want to convert it to a data frame. So let's call the data frame constructor. Pd.DataFrame, and then we'll pass in similar_to_Tortas, and let's name our column Pearson R.
So we'll say columns equal to and write Pearson R here. And then let's call this whole thing core Tortas. And we don't want to see all the null values so let's drop those. Another way to do that is to call the drop NA method and pass in the argument and place equal to true. So I'll show you that now. We'll say corr Tortas.dropna and then in place equal to true.
Lastly we'll print the head. So corr Tortas.head, and as you can see here, Python is returning a runtime warning, but it doesn't effect our results at all so we'll just ignore it, and looking at the head of corr Tortas, we see that we have a data frame that contains each place ID and a Pearson R correlation coefficient that indicates how well each place correlates with Tortas based on user rating.
But let's think about this for a minute here. If we've found some places that were really well correlated with Tortas but that had only, say, two ratings total, then those places probably wouldn't really be all that similar to Tortas. I mean maybe those places got similar ratings as Tortas, but they wouldn't be very popular. Therefore, that correlation really wouldn't be significant. We also need to take stock of how popular each of these places is, in addition to how well the review scores correlate with the ratings that were given to other places in the dataset.
So to do that, let's join our corr Tortas data frame with a rating state of frame. So we're going to say corr_Tortas.join and we want to join it to the rating data frame, but we're only interested in rating count here. So we say rating_count, and we'll call this the Tortas corr summary. Tortas_corr_summary.
And run this. Let's create a filter now so that we can see only the places from the data frame that have at least 10 user reviews, and for those places, let's look at the Pearson R correlation coefficient sorted in descending order. So we'll do that by first creating the filter. So we'll say Tortas_corr_summary, and we're interested in only the records that have a rating count that's greater than or equal to 10.
So we have our filter here. And we're retrieving records from the Tortas corr summary data frame. So let me just write the name of that data frame here. Great, and then from the records that are returned, we want to sort the values in descending order according to the Pearson R column. So we're going to say sort values.
We'll use the sort values method and then we pass in column name Pearson R and we want this to be in descending order so we're going to pass in an argument that says descending equal to false, and then let's look at only the first 10 records. So we'll call the head method and we'll pass in the number 10. Since we sorted the data frame in descending order by correlation, we now have a list of top reviewed places that are most similar to Tortas.
I want to point out these places here that have a Pearson R value of one though. These Pearson R values of one aren't meaningful here. The reason you're seeing these is because for those places, there was only one user who gave a review to both places. That user gave both places the same score. Which is why you're seeing a Pearson R value of one. But a correlation that's based on similarities between only one review rating, that's not meaningful. The places need to have more than one reviewer in common.
So we'll throw those places out. So now let's take the top seven correlated results that remain and see if any of these places also serve fast food. So what we're going to do is we're going to create a data frame and we're going to call that places_corr_Tortas and then let's call the data frame constructor, then we pass in a series of numbers that are the place IDs for the top correlated places.
So as we can read off of the top, the first one is going to be 135085. That's Tortas. And then the next place, we're going to throw out the value that ends in 66 and then the next place it's going to be 132754. 135045, 135062, 135028, 135042, 135046, okay great.
Now the next thing we need to do is set our index. So we're going to say index is equal to, and then we're going to use the numpy arrange function. So we'll say np.arrange and pass in the value seven. Lastly we're going to name our columns. So we're going to say columns equal to, and we're going to call it place ID. The next thing I want to do is to create a summary data table.
So we're going to call this data table summary and it's going to be based on the merge between places corr Tortas and cuisine. 'Cause basically I'm trying to create a summary of each of the top correlated place IDs and the types of food they serve. So let's call the merge function. We'll say pd.merge and pass in the name of the data frames we want to merge. places_corr_Tortas, and cuisine.
Then we pass in an argument that says on equal to place ID because we want to merge these data frames on the place ID field. And then print that out. Looks like I made a typo. Add the A here. When we print this out, we only get five results. And we included seven place IDs in this data frame. But the reason why you're only seeing five places here is that not all of the places were listed in the cuisine's dataset.
Places that weren't in the cuisine's dataset were not able to be returned in this merged output table. None the less, what we are seeing here is that among the top six places that were most correlated with Tortas, at least one of these places also serves fast food. Let's get a name for this place so we don't have to refer to it as a number. So we'll create a filter again. We'll say places where place ID is equal to 135046, and we want to return only that record from the places data frame, and when we run this we see that, that place is actually called Restaurante El Reyecito.
So we'll call it Reyceito. To evaluate how relevant the similarity metric really is though, let's consider the entire set of possibilities. Meaning how many cuisine types are served at places in this dataset. To do that we'll use the describe method. So we'll say cuisine and it looks like the R cuisine column and then we'll call the describe method off of this. And when we run this, we can see that according to our cuisine data frame, there are 59 unique types of cuisines that are served.
So in last analysis, what we got back were six top places that were similar to Tortas based on correlation and popularity. Of these six places, one other place also serves fast food. Considering that there are 59 total cuisine types that could have been offered, and that we got back another fast food place in our top six most similar places, it looks like our correlation based recommendation system is on track. In this case, we'd be safe recommending the places Restaurante El Reyecito to users who also like the restaurant Tortas.
That was a bit complicated, right, but don't worry, we'll use the machine learning algorithms for the rest of the course and the code on most of the upcoming demonstrations will be a lot simpler.
- Working with recommendation systems
- Evaluating similarity based on correlation
- Building a popularity-based recommender
- Classification-based recommendations
- Making a collaborative filtering system
- Content-based recommender systems
- Evaluating recommenders