From the course: Building Recommender Systems with Machine Learning and AI

DSSTNE in action

- [Instructor] DSSTNE is open source, and you'll find it on Github. It's made of Ubuntu systems that have one or more GPUs attached. A GPU is the core of a 3D video card, which can also be used to run neural networks on. I don't have a system like that handy, so I'm going to use Amazon's EC2 service to rent time on one instead. The kind of machine you need for this isn't cheap, so I'm not going to ask you to follow along here, as it would cost you real money. What we're doing does not fall under Amazon's free usage tier for people who are just starting and experimenting. If you want to see all of the steps involved in getting the machine itself set up with all the necessary hardware, it's all detailed on the setup page for the project in Github here. But I'm going to skip past the steps for actually procuring a machine and installing this necessary software and just jump straight to playing around with this example. Instructions for running the example are also here in Github. They point out that generally, using DSSTNE involves three steps: Converting your data, training, and prediction. First, let's go to the directory that contains the movielens example. We'll start by downloading the movielens 20 million rating dataset from GroupLens. All we actually need is the ratings data, so we will extract that from the zip archive and save it as ml-20m_ratings.csv. DSSTNE, however, can't deal with csv files, nor does it want data that's organized as one rating per line. Ultimately, DSSTNE requires file formats in netCDF format. They provide a tool that can convert files into it, but that conversion tool also has its own requirements. So, we need to convert our ratings data into a format that can then be converted into netCDF. This first step requires arranging our data so that every row represents a single user, followed by a list of all the rating inputs for that user. What we want is each line to contain a user ID, followed by a tab, followed by a colon-separated list of each item ID that user rated. If you think about it, that's basically a sparse representation of the input layer for each individual user, one user per line. So, this input format gets us very close to what we need to train our neural network with. Amazon provides an awk script to convert our csv file into the intermediate format. Let's take a quick look with the cat convert_ratings.awk. It takes advantage that all of the ratings data for a given user are grouped together in the raw movielens data. It just collects all the ratings for each user, then spits out a line for each user when it's done. Let's go ahead and run that conversion script. And if we take a peek at the output, we can see that it does contain training data, one user per line, with the items that user rated in a sparse format. Now, we need to convert this into the netCDF format DSSTNE requires. This actually generates three files: The netCDF itself, an index file with the NCDs of each neuron, and an index file with the NCDs of all samples. We also need to generate the output layers data file using the same sample NDCs that we generated for the input file. Now, we're ready to rock. Let's take a quick peek at config.json again, which defines the topology of our neural network and how it is trained. It's the same file we looked at in the slides and already covered, but again, there at the end is the meat of the three layers in our neural network. It's pretty amazing how simple this looks. But remember, there's a lot of complexity that's being hidden from you here, involving sparse autoencoders, a tuned loss function, and a denoising feature, as well as all the complexity that comes with the sparse data and distributing the training of this data across a neural network that's not running on our CPU, but on the GPU, and possibly even multiple GPUs. It's pretty amazing what's going on here. Let's kick it off with parameters that indicate a batch size of 256 and 10 epochs. That was actually amazingly quick for 20 million ratings. If you had more than one GPU, it would be even faster. In that case, there is an NPI run command you would use instead of the train command. The output of this is the gl.nc file, which contains the trained model itself. So, now that we have our trained model, let's use it to create top 10 recommendations for every user. Oops, got a typo there. Forgot a dash. The parameters there are all pretty self-explanatory. Dash k 10 means we want top 10 recommendations and we are passing in our model that lives in the gl.nc file. One thing worth mentioning, however, is that the predict command will automatically do filtering as well, which is why we're passing in the ml-20m_ratings file again. This allows it to filter out items the user has already rated. That didn't take very long. So, the results are in the recs file. Let's take a peek at it. Each line is a user ID followed by a list of their recommended items and scores. Unfortunately, it's not human-readable, but let's spot check a few. I've opened up the movies.csv file in Excel so we can look up a few of these recommended item IDs and see what movies they correspond to. Let's check out the first few movies for our user one. They turn out to be 1196, Star Wars. Which one? The Empire Strikes Back. What's the next one? 4993. Lord of the Rings: The Fellowship of the Ring. That's pretty exciting. It seems to have nailed user one as a science fiction and fantasy fan and recommended some of the greatest science fiction and fantasy movies of all time. Let's reflect again on what just happened here. We trained a feed-forward three-layer neural network with sparse data from 20 million ratings in just a few minutes by using the power of a GPU to accelerate that neural network. There was no need to deal with all the complexity of RBMs in order to apply neural networks to recommender systems. DSSTNE has made it so we can apply deep learning directly to this problem and think of it as any other classification problem a deep neural network might excel at. It's purpose-built for applying deep learning to recommender systems and that in itself is pretty darn exciting.

Contents