Learn how to standardize, or normalize, your data so one larger or smaller set of values won’t throw off your result.
- [Instructor] Machine learning data sets will occasionally contain a column of values that are much larger or much smaller than the others. The classic example is house prices, where the value of the house is much larger than the number of square feet, bedrooms or bathrooms. In this movie, I will show you how to standardize, also called normalize, your data so one larger or smaller set of values won't throw off your result. My sample file is the standardize notebook, and you can find it in the Chapter 2 folder of your exercise files collection.
I just opened this notebook, so I need to evaluate it to assign the values to my variable A list. So I will open the evaluation menu, and click evaluate notebook. Great. Now I can perform some calculations on the data as I have it. So let's say I have to find the mean, or the arithmetic average, of the variables, or the values, in A list. I can do that by typing mean followed by left square bracket and the variable name A list followed by right square bracket and Shift Enter.
And I get an average of 2051. I can also check for the variance, which is the sum of the squared errors. In other words, however much anyone of the individual values varies from the mean, 2051, we square that value to get the variance. And I can calculate that using the variance keyword. Left square bracket and A list again, right square bracket and Shift Enter. I get the expression 1,422,494 divided by 9.
If I wanted to see that as a numerical value, I could type N followed by left square bracket and then variance A list right square bracket to close out variance right square bracket to close out N Shift Enter and there's the value. If I want to standardize the list, I am dividing the list so that it has a mean of zero and a standard deviation of one.
To do that I will type standardize actually I need to assign this to a variable. So I will backspace over standardize and I will call this new list equals and then the standardize keyword, followed by left square bracket and then A list, which is my original data, followed by a right square bracket and Shift Enter.
You'll note that the original values are being divided by the square of the total. And any value tha is less than the average, 2051, is negative. And any value that's greater is positive. Let me show you what new list looks like as a number series. So I'll type N, new list, and I have new list in square brackets. Shift Enter and there are the values. If I want to see the mean of new list I can use the mean keyword again, and new list again, in square brackets, Shift Enter, and I get a mean of zero.
And the standard deviation, which is the square root of the variance, we can calculate using standard deviation then left square bracket and new list right square bracket, Shift Enter, and we get a standard deviation of one. So if you ever have a data set where you're concerned that perhaps one of the columns might be throwing off your results, because of the magnitude of the values in that column, you can use standardized to bring the results more in line.
- Separating training data from test data
- Importing data from a file
- Preparing data for machine learning
- Grouping and sorting elements using a rule
- Determining functions that generate data
- Finding a fit using a linear model
- Performing supervised learning tasks
- Classifying items using training data
- Identifying data clusters