Learn how to store all product recommendations in a vector for efficiency.
- [Instructor] In machine learning, we often work with large arrays of data. These arrays are sometimes called vectors for single columns of data and matrices for larger arrays because of the linear algebra roots of machine learning. Let's look at how to work with vectors in code. Let's open up vectors pt1.py. Here we have a simple array or vector representing all the ratings that a single movie received from different users. When we are using machine learning algorithms, we'll often need to apply the same mathematical operation to an entire array.
Let's say we want to convert all these five star ratings to a 10 point scale. In other words, we want to multiply each rating by two. What's the fastest way to do this? In traditional programming, the standard solution is to loop through the array one row at a time using a for loop. Let's run the code and check the output. We'll choose right-click and Run. You can see we made 12 separate updates to the array before we got the final result. This works, but doing multiplication on each element in the array one element at a time is inefficient. Modern CPUs have the ability to take a list of numbers and apply the same mathematical operation to many numbers in parallel.
This capability is called single instruction, multiple data, or SIMD. Instead of looping through each array element one at a time, the CPU can load chunks of the array into memory and do all of the multiplication operations on the chunk in one step. This makes a huge difference in speed when processing large arrays. Let's open up vectors pt2.py. Instead of using for loops to work with arrays, we can use an array library that knows how to work with data in parallel. The library we'll use is called NumPy. NumPy lets us create arrays and memory in a very efficient way, and it automatically parallelizes common operations on arrays.
So instead of using a for loop, our code will look like this. First, we create the array as a NumPy array instead of as a normal Python array. Then we multiply the entire array by two. NumPy will apply this operation to each element in the array separately. Let's try this out. Right-click, choose Run. We get the exact same answer as before. Now let's look at both versions side-by-side. Right-click, Split Vertically. Not only did we get the same answer as before, but it took even less code. But more importantly, NumPy automatically takes advantage of the CPU's SIMD features to multiply chunks of the array in parallel.
We get the same result as using a for loop. We don't have to go through nearly as many steps. Most operations you'll need to do an array can be done in parallel. That includes simple math like addition, subtraction, multiplication, and division, and even more complex operations like sines and cosines. This is called vectorizing our code. We're replacing iterative loops with vector operations that can be executed in parallel. This is an important point. If you find yourself writing a for loop over an array, you're probably doing the wrong thing. Instead, you should be using NumPy to do the operation on the whole array in one step.
Recommendation systems are a key part of almost every modern consumer website. The systems help drive customer interaction and sales by helping customers discover products and services they might not ever find themselves. The course uses the free, open source tools Python 3.5, pandas, and numpy. By the end of the course, you'll be equipped to use machine learning yourself to solve recommendation problems. What you learn can then be directly applied to your own projects.
- Building a machine learning system
- Training a machine learning system
- Refining the accuracy of the machine learning system
- Evaluating the recommendations received