Learn about transforming dataset distributions.
- [Instructor] The last thing that I want to discuss in the math and statistics portion of this course is scaling and transforming variables. You scale variables so that differences in magnitudes don't produce erroneous and misleading results. For example, imagine you're in charge of sales and marketing for Zack's Department Store. To measure the success of a recent holiday campaign, you decide to compare daily sales revenues from a data set in 1990 from one in 2016. That's all you could get, so you measure the average sales revenue increase between November 15th and December 15th back in 1990.
There is an average increase of $20 per checkout in this time period, but in 2016 the average increase was $200 per checkout. Is that net gain of $180 per checkout due to your marketing savvy? No, it's due to other factors like monetary inflation and in increase in brand trust since 1990. You're trying to compare apples and oranges here because you forgot to scale your variables. Before comparing seasonal sales revenue changes, make sure you scaled your variables to the same range of values.
There are two ways you can scale your data. One is normalization and the other is standardization. I'm going to show you how to do both of these in our coding demonstration. But to explain, normalization is putting each observation on a relative scale between zero and one. So basically you take the value of your observation and you divide it by the sum of all observations in a variable, kind of like back in high school when you calculated the grade on your test based on how many questions you missed out of the total number of questions on the exam. Standardization is rescaling data so it has a zero mean and unit variance.
If this doesn't make much sense to you now, don't worry too much because I'm going to show you what this means in the coding demonstration. Before going into the coding demonstration though, I want to discuss scikit-learn and its preprocessing tools. Scikit-learn is the machine learning library for Python, and it's got a whole set of preprocessing tools. There is tools for scaling your data, centering your data, normalizing it, binning it, and imputing it. In this section, I'm going to show you the scaling function and the normalizing functions.
Let's look at how to transform data set distributions. In this demo, we're going to use numpy, pandas and scipy, so we'll import those. And we're also going to use the standard libers for data visualization you saw back in chapter two, so we'll import those. We're also going to use scikit-learn. We just talked about that. Let's import that by saying import sklearn, and then from that, we'll import the preprocessing module from sklearn.
Import, preprocessing, and then from sklearn.preprocessing. From that module we want to import the scale function. Execute that code and you've got the libraries you'll need. Let's add our data visualization perimeters like we have been throughout the course. And again this example is going to use the mtcars data set.
So we'll load that like we have been. Let's isolate and plot the mpg variable. I want to show you the scale function. As you saw in chapter two, we do that by saying mpg, and selecting the mpg variable from the cars data frame, and then we plot it. Say plt plot, call the plot function on the mpg variable. And we'll get a line plot. As you can see the max and min values approximately arrange between 35 and 10.
Also the average of the variable is about 20. Another easy way to get a handle on this information about a variable is by using the describe method. Let's try that instead. So let's select our variable, and we'll call the describe method off of it. So we see here that the mean is 20.09. The min value is 10.4, and the max value is 33.9. Okay, so now I want to show you how to transform this variable. So the first thing we need to do is place our value in a one column matrix.
We can do that by writing mpg.reshape, and passing in the shape of negative one and one, and we'll call this mpg matrix for an objective name. Now let's instantiate a min max scaler object. Min max scalers the function we use to transform our variable by scaling it to a defined range. By default, that range is between zero and one.
So we'll call the min max scaler function, but we need to be able to access that from the preprocessing module, so we'll write preprocessing dot, and then call the output scaled. Next we call the fit underscore transform method. In order to fit the min max scaler transformer to our data and return a transformed version of that data. I'll show you. So we take our min max scaler object, we called that scale.
And we called the fit underscore transform method off of it. And we passed in our matrix. And then let's call this whole thing scaled mpg. Lastly let's plot it out and see what our axes look like to se the difference between the non scaled variable and the scaled variable. Pass in our output object. Oh it looks like I spelled scaler wrong so we just replace the a with an e.
So we plot that out and we can see that now the max value of the mpg variable has been scaled down to just less than one. And the min value is just greater than zero. Very cool. Just in case you need to scale the variable to a different range though, you can always pass in the feature range argument to define the exact range you want your variables to be scaled to. So for our example, I'm just going to borrow the code from the transformation we just did, but this time I'm going to pass in the argument feature underscore range, say equal to between zero and 10.
And as you can see here the values along the y axis have been scaled between zero and 10. Now I'm going to show you how to standardize a variable. To standardize you just use the scale function. By default, this function will center the data and scale it to unit variance. Let's see what happens when you pass in the arguments with mean equals to false and with std equal to false. The first thing we need to do is call our function, and then we'll pass in the name of our variable mpg, and we'll say access c equal to zero, and then with mean equal to false with std equal to false.
We'll call this whole thing standardized mpg, and then we'll call the plot function on it. So plt.plot and then fastener object and print it out. As you can see from the plot, we just get our original variable back, unscaled and untransformed. But when you call the scale function on a variable without passing in any arguments, for the with mean and with std, then the function carries out its default transformation processes.
Let me show you. So we'll say scale, pass in the mpg variable, and then again we'll call this standardized mpg. Call the plot function on it, and print it out. You can see from this simple line chart, that the mean of the mpg variable has been centered to zero and the distribution now has unit variance. In other words, the variable now has a standard normal distribution, it's been standardized.
In case you've not taken a lot of statistics, and you're curious about what unit variance is, I recommend that you look at the information at the link shown here.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup