Normalizing numeric data can improve the quality of machine learning models. In this video, learn how to use the MinMaxScaler class to normalize numeric data.
- [Narrator] Normalizing is the process of mapping numeric data from its original range into a range from zero to one. Now, this is important, because you may have multiple attributes with different ranges. For example you can have salaries, which might have ranges in the tens, and hundreds of thousands. Then, you might have another column, for example, Miles commuted to work, which might be on the order of tens of miles. The reason we want to normalize those attributes in a zero to one range is so that when algorithms that use distance as a measure, they don't weight some attributes, like salary, orders of magnitude, more heavily than others, like miles commuted to work.
So, let's take a look at how normalizing works. We have a salary, from 60,000 to 105,000 here, and we can map that into a range of zero to one, where 60,000 is zero, and 105,000 is mapped to one. Salaries in between are also mapped between zero and one, but they maintain their relative distance between them. All right, let's work on normalizing some data. Now, if you haven't installed Anaconda Python, you should execute the command pip, space install, space numpy, that's N-U-M-P-Y, before starting pyspark.
So, let's see where I am, I'm in the sport bin directory, so I can start pyspark. Now, normalizing data is the process of mapping a set of numeric values to a new set of values in the range from zero to one. We do this so that differences in the scale of different features do not adversely affect our models. So for example, a salaries attribute may have a range from the tens of thousands to the hundreds of thousands, while a miles commuted to work might be in the range of one to 30 miles.
Okay, our pyspark session has started, so I'm going to type control L to clear the screen. Let's import some packages here. So, from pyspark.ml feature, we're going to import the MinMaxScaler, and then from pyspark ml linear algebra, we're going to import Vectors.
So, now we have imported what we need, so I'm going to clear that screen again so I can have a fresh screen, and now I'm going to create a simple data frame. Now, each row of the data frame will include an identifier and a list of numeric values. So, I'm just going to call this features data frame, or features_df, and I'm going to reference the spark context, and create data frame, and I'm going to create a list which contains three records.
The first record will have an ID of one, and then it will have a set of features which we create as a dense vector, and that vector will include the number 10, the number 10,000, and the number one. The second record will have an ID of two, and its vectors will have values of 20, 30,000, and two.
Record three will have values 30, 40,000, and three. And we'll close off that list, and we'll specify the columns for the data frame.
So, if we look at features_df take one, the first value we have is our row, which has an ID of one, and then the dense vector with 10, 10,000, and one. So, this is as we expect. Now, the next thing we want to do is create a scaler object. So, I'm going to clear the screen, and now I'm going to create a scaler object called feature_scaler, and we're going to call the MinMaxScaler function, and we're going to tell it that we want to transfer the input column, which is named features, and we want that scaled version of that input column to go to a new output column, which is called sfeatures, which is short for scaled features.
Now, this object will transform the contents of feature vectors into a scaled version, and save it into the sfeatures column. So next, we'll fit the model to the data using the fit function. To do that, we'll create an object called smodel, and that'll be set equal to the feature scaler, and we'll apply the fit function, and the data we're going to fit is what's loaded into our features data frame.
Great, so now we have fit the data to our model. The next thing we want to do is we want to call the transform function, and what this will do is it will apply the transformation and actually create the scaled data set. So to do this, I'm going to create a new data frame, called sfeatures, underscore df, for data frame, and this is going to be built using the smodel we just defined, and we're going to transform using the features data frame.
Okay, so what we've done is we've created a MinMaxScaler, we fit our data to it, and then we used the transform to create a new scaled feature set. So, I'm just going to clear the screen here. And now, let's take a look at the first row of the scaled features set data frame. So, that's sfeatures_df, and we're just going to take a look at the first row, and what you'll notice here is, in addition to the ID and features that we had in our original data frame, we now have a new column, called sfeatures, which has a dense vector, which is scaled, and you can see it's in the zero to one range.
So, let's actually look, and compare the original data with the scaled version. To do that, I'm going to use sfeatures, the data frame, and instead of looking at just one, I'm going to select all of the features, and all of the sfeatures, and I'm going to show that in a well structured output format. As you can see, the scaled data is in the range from zero to one, and the larger the original value, the larger the scaled value.
The smallest value in each column of the feature vector is mapped to zero, and the largest value is mapped to one. Values in between the minimum and maximum are scaled proportionally between zero and one.
- Machine learning workflows
- Organizing data in DataFrames
- Preprocessing and data preparation steps for machine learning
- Clustering data
- Classification algorithms
- Regression methods available in Spark MLlib
- Common approaches to designing recommendation systems