Join Curt Frye for an in-depth discussion in this video Rescaling data to minimize bias, part of Mathematica 10 Advanced Analysis.
- View Offline
- Many machine learning applications use a variety of data inputs to generate results. The classic teaching example is to calculate home value based on comparable sales prices and the square footage of the home that was sold. Square footage is often between 1,000 and 3,000 square feet in the U.S. While sale prices can be $200,000 to $300,000 or more. The difference in scale between 200,000 and 1,000 can throw off your results in some cases. So in this movie I will show you two ways to rescale your data to minimize bias.
I'm working in a blank Mathematica notebook, and the first thing that I need to do is to enter in a series of values, and I'll call those dValues. So lowercase dValues = and I'll make this a list, so I'll start that list with a left curly bracket, and I'll type 1800 comma 2500 comma 3400 comma and 1347, followed by a right curly bracket to close the list.
These values are the square footage of homes that have been sold. So I'll type shift + enter just to make sure that I entered everything correctly. If I want to rescale those values using the Rescale keyword, I can type in a new variable, and I'll call that rescaledValues, and equal, then I'll type in "N" which is used to generate a numeric response as opposed to a symbolic response, then a left square bracket and the Rescale keyword. "R-E-S-C-A-L-E" then a left square bracket and I'll use dValues followed by a right square bracket to close the Rescale argument list and another to close the N or numeric argument list, and when I press shift + enter, I get rescaled values.
The largest value, which is 3400, is listed as one, in other words, that's the maximum, and the smallest value, in this case 1347, the fourth one, is zero. And the other two values, 1800 and 2500, represent where those two values lie on the continuum to find by a minimum of 1347 and a maximum of 3400. So this means that all of your values will be between zero and one. Another way to rescale your data and one that is often used in machine learning is to rescale values as a proportion of the data set's average.
So let's find the average of the values in dValues. I'll assign its variable avg = and it would be the Mean keyword, "M-E-A-N," that's another word for average. Left square bracket, then dValues, right square bracket and shift + enter and I get 9047 divided by four, and it's a symbolic answer but it doesn't matter because I'm going to be defining a numeric value in a moment. Now if I want to calculate the proportion of each of the values in the dValues list divided by the mean of those values, then I can create a new variable called newValues equals N, again forcing a numeric answer, left square bracket, then the Divide keyword, "D-I-V-I-D-E," left square bracket, then dValues comma "A-V-G." So again what I'm doing is dividing each value in dValues by the average that I calculated and I am requiring a numeric answer as opposed to this type of symbolic answer.
So I'll type two square brackets, right square brackets to close out my argument list and shift + enter, and there I see my values. I see that 1800 is about 80% of the average. 2500 is 1.1, 3400 is 1.5, and 1347 is about 6/10 of the average value. I encourage you to use both of these methods on your data and find the one that works best for your particular application.
- Generating maps
- Displaying statistical data about countries and cities
- Displaying changes in stock prices
- Calculating exchange rates
- Counting links in a social network
- Calculating reciprocity
- Applying machine-learning algorithms
- Analyzing text data