Learn how all user's product reviews are stored in one large, two-dimentional table.
- [Instructor] Our movie review dataset contains one row for each rating. This is the format that reviewed data is typically collected in, but in order to build a recommendations system from this data, we want to create a matrix or two-dimensional array that shows which movies have been rated by which users. The matrix will have one row for each user and one column for each movie. Let's take a look at the code in create_review_matrix.py. First, we're going to use Pandas read_csv function to load the movie_ratings_data_set.csv file. This rating has one row for each individual movie review.
To turn this into a matrix that summarizes all reviews across all movies, we need to use Pandas pivot table function. A pivot table takes a list of data and summarizes it with one row and one column for each unique user and unique movie in our dataset. If you have used pivot tables in spreadsheet software like Microsoft Excel, it works exactly the same way here. First we pass in the data frame containing the data we want to summarize. Then, we need to tell Pandas to use the user ID field for the rows or index in the pivot table, and the movie ID is the columns in the table.
When summarizing data with a pivot table, it's possible that we'll have duplicates. This can happen if the same user viewed the same movie twice, but gave it two different ratings. So we have to decide how to resolve duplicates by telling Pandas which function to use to aggregate duplicate data. We'll pass in the parameter called aggfunc=np.max. This tells Pandas to use NumPy's max function to handle duplicates. The max function will return the highest number, so if a single user rated the same movie twice, we'll take the higher rating. If you instead wanted the user's average rating, you could pass in np.mean instead.
Finally, we'll convert this table to HTML and open it in our browser. Let's run the code and take a look. Right click, choose Run. This table is a summary of all reviews across all movies. The users are listed down the left and the movies across the top. For example, we can see that user number one rated movie number nine a four. The blank spaces are movies that haven't been rated yet by that user. If we scroll through the whole dataset, we can see that no single user has rated every single movie, in fact, most of the array is blank.
We only have a relatively small amount of good data to work from. This is called a sparse dataset. Sparse datasets are normal for recommendation systems. Most users will only review a small number of products so there will always be a lot of blank data, but this is enough information for us to work with.
Recommendation systems are a key part of almost every modern consumer website. The systems help drive customer interaction and sales by helping customers discover products and services they might not ever find themselves. The course uses the free, open source tools Python 3.5, pandas, and numpy. By the end of the course, you'll be equipped to use machine learning yourself to solve recommendation problems. What you learn can then be directly applied to your own projects.
- Building a machine learning system
- Training a machine learning system
- Refining the accuracy of the machine learning system
- Evaluating the recommendations received