Learn why we use vector math instead of for loops and why it's more efficient.
- [Instructor] In machine learning, we often work with large arrays of data. These arrays are sometimes called vectors for single columns of data, and matrices for larger arrays, because of the linear algebra roots of machine learning. Let's take a look at how to work with vectors in code. Let's open up vectors pt1.py. Here we have a simple array, or vector, representing how many square feet are in some of the different houses in our training dataset. When we train machine learning algorithms, we'll often need to apply the same mathematical operation across every row in our training dataset.
For example, let's say we want to multiply each of these square foot measurements by a weight of 0.3. What's the most efficient way to do this? In traditional programming, the standard solution is to loop through the array one row at a time with a for loop, like this. Let's run the code and check the output. To run the code, we'll right-click and choose Run. Here in the console, we can see that it made 13 separate updates to the array before it got the final result. This works, but doing multiplication on each element in the array one element at a time is actually really inefficient.
Modern CPUs have the ability to take a list of numbers and apply the same operation to many numbers in parallel. This capability is called Single Instruction, Multiple Data, or SIMD. Instead of looping through each array element one at a time, the CPU can load chunks of the array into memory and do all the multiplication operations on that chunk in one step. This makes a huge difference in speed when processing large arrays. Let's take a look at vectors pt2. Instead of using for loops to work the arrays, we can use an array library that knows how to work with data in parallel.
The library we'll use is called NumPy. NumPy lets us create arrays in memory in a very efficient way, and it automatically parallelizes common mathematical operations on arrays. So instead of using the for loop, our code will look like this. First we'll create the array as a NumPy array instead of as a normal Python array. Then we'll multiply the entire array by 0.3. When we tell NumPy to multiply an array by a single number, NumPy will apply this operation to each element in the array separately. Let's run the code. Right-click, Run.
Here we can see we got the same answer as before. Let's look at both versions side-by-side. I'll split the screen vertically. So not only did we get the same answer as before, but it took even less code. But more importantly, NumPy automatically takes advantage of the CPU SIMD features to multiply chunks of the array in parallel. We get the same result as using a for loop, but we don't have to go through nearly as many steps. Most operations you'll need to do on an array can be done in parallel. This include simple operations like addition, subtraction, multiplication, and division, and even more complex operations like sines and cosines.
This is called vectorizing our code. We are replacing iterative loops with vector operations that can be executed in parallel. This is a really important point. If you find yourself writing a for loop over an array, you're probably doing the wrong thing. Instead, you should be using NumPy to do the operation on the whole array in one step.
- Setting up the development environment
- Building a simple home value estimator
- Finding the best weights automatically
- Working with large data sets efficiently
- Training a supervised machine learning model
- Exploring a home value data set
- Deciding how much data is needed
- Preparing the features
- Training the value estimator
- Measuring accuracy with mean absolute error
- Improving a system
- Using the machine learning model to make predictions