# Transforming outliers

## Video: Transforming outliers

In the last video, we talked about a few relatively simple ways of dealing with outliers, that is, either leaving them in, if it can be justified; rolling them into other categories, but at the risk of a heterogeneous group; or deleting them or selecting them out temporarily of the analyses. Now while these approaches may make sense if you don't have too many outliers, say for instance no more than 2% or 3% as a rough estimate, they also do some damage to the data and can cause you to lose cases, and you may have worked very hard to get those data.

## Transforming outliers

In the last video, we talked about a few relatively simple ways of dealing with outliers, that is, either leaving them in, if it can be justified; rolling them into other categories, but at the risk of a heterogeneous group; or deleting them or selecting them out temporarily of the analyses. Now while these approaches may make sense if you don't have too many outliers, say for instance no more than 2% or 3% as a rough estimate, they also do some damage to the data and can cause you to lose cases, and you may have worked very hard to get those data.

So another alternative if you have a scale variable is to perform a mathematical transformation on the data. What this does is it modifies all the scores in the variables, generally creating a new variable on the process, using a set formula. Now people are very familiar with transformations, such as multiplying or adding or subtracting a certain amount, and that's taken as common practice. What we're going to be doing in this case, the most common approach for distributions that have a few extremely high scores, like the market capitalization one that we looked at in the last one, is to take the logarithm of the scores.

Now you may remember logarithms from junior high. These have the effect of bringing in extremely high scores. So for instance, the logarithm of 10 is 1, the logarithm of a 100 is 2, the logarithm of a 1,000 is 3, and it brings in the scores in a predictable way. And this is a legitimate way of dealing with outliers, as long as you always specify that you were dealing with the logarithms from this point on. On the other hand, if you have unusual scores at the low end of the distribution, you might want to try squaring the scores, because what that does is it pushes all the scores up but pushes the higher ones even further.

Now in both situations this assumes that you do not have zeros or negative scores, you have all positive scores. There are other ways of dealing with those. You can add a constant to them, but we don't need to deal with that right now. What I'm going to do is I'm going to look at the market capitalization data that we had in our last data set. Now I had filtered out cases of under \$100 million market capitalization. I'm going to undo that filter right now. I'm going to Data, to Select Cases, to say please use all of them.

And so now it just tells me that the filter is off, and you can see that none of them are selected out anymore. And I'm going to come back here and let's take another quick look at the box plot for market capitalization that we did before. We have an extremely skewed distribution. Now let's try to find if doing a logarithm could help make this a little less skewed. What we do is we come to Transform, to Compute Variables, and I'm going to create a new variable called LogMarketCap, and that's pretty easy.

It is going to be the logarithm of the market capitalization. Now we've two choices for logarithm. Log10, this is what's called the base 10 logarithm. It takes the number 10 and raises it to a particular exponent to get a number, and that exponent is the logarithm. There's also the natural logarithm, which is on the base e 2.71828, dada, dada, dada, and an irrational number. And while they're very pleasing aesthetic aspects of the natural logarithm, because it's easier to interpret the base 10 logarithm, that's one we usually use.

So what I do is I double-click on that and I bring it up the numerical expression. I just double-click on MarketCap and it fills it so it says Log10MarketCap. Press OK and it tells me that it's created a new variable. If I go to the data set, I can see it right here at the end. You see the numbers are much smaller than most double digits, but that's because we're dealing with very large numbers over here, and that logarithm has to do more with the number of zeros in the number. Now what I'm going to do is I'm going to go back and create another box plot, but instead of doing market capitalization this time, I'll do the log of the market capitalization.

Just drag that in and leave everything else the same. And in this case, what's interesting about it is that we still have outliers, but this time they are symmetrically distributed, that we have outliers on the high end, but we also have outliers on the low end. And in fact, the distribution is remarkably symmetrical. It looks like it's spread out almost exactly the same amount in each direction. And you can see also that Apple, it is an outlier, but look how close it is for instance to Google, whereas here, here's Apple and here's Google down here.

So what we've done is we've taken a extremely asymmetrical skewed distribution and by taking the logarithm, we've pulled it in and made it symmetrical. Now there are still outliers, but they are on both sides and they're not terribly far away like they were before. And so we've taken a variable that really we might not have been able to deal with before or we had to cut awful lot of the scores to make it work, but now we can actually leave all of the scores in, we can use the entire data set, and still come pretty close to meeting the assumptions of most of this statistical procedures.

And so a logarithmic transformation in this case was a huge help in making our data meet the assumptions that we need to make it more manageable for analysis.

Show transcript

#### This video is part of

SPSS Statistics Essential Training (2011)

52 video lessons · 19191 viewers

Author

Expand all | Collapse all
1. ### Introduction

2m 58s
1. Welcome
1m 5s
2. Using the exercise files
40s
3. Using a different version of the software
1m 13s
2. ### 1. Getting Started

19m 0s
1. Taking a first look at the interface
11m 49s
7m 11s
3. ### 2. Charts for One Variable

21m 54s
1. Creating bar charts for categorical variables
7m 18s
2. Creating pie charts for categorical variables
2m 54s
3. Creating histograms for quantitative variables
5m 45s
4. Creating box plots for quantitative variables
5m 57s
4. ### 3. Modifying Data

33m 10s
1. Recoding variables
5m 33s
2. Recoding with visual binning
5m 33s
3. Recoding by ranking cases
5m 26s
4. Computing new variables
5m 37s
5. Combining or excluding outliers
5m 21s
6. Transforming outliers
5m 40s
5. ### 4. Working with the Data File

28m 12s
1. Selecting cases
6m 44s
2. Using the Split File command
5m 12s
3. Merging files
5m 33s
4. Using the Multiple Response command
10m 43s
6. ### 5. Descriptive Statistics for One Variable

22m 14s
1. Calculating frequencies
8m 43s
2. Calculating descriptives
5m 31s
3. Using the Explore command
8m 0s
7. ### 6. Inferential Statistics for One Variable

16m 3s
1. Calculating inferential statistics for a single proportion
6m 6s
2. Calculating inferential statistics for a single mean
5m 39s
3. Calculating inferential statistics for a single categorical variable
4m 18s
8. ### 7. Charts for Two Variables

30m 43s
1. Creating clustered bar charts
7m 10s
2. Creating scatterplots
5m 8s
3. Creating time series
3m 24s
4. Creating simple bar charts of group means
4m 17s
5. Creating population pyramids
3m 0s
6. Creating simple boxplots for groups
3m 3s
7. Creating side-by-side boxplots
4m 41s
9. ### 8. Descriptive and Inferential Statistics for Two Variables

45m 28s
1. Calculating correlations
8m 17s
2. Computing a bivariate regression
6m 27s
3. Creating crosstabs for categorical variables
6m 34s
4. Comparing means with the Means procedure
6m 33s
5. Comparing means with the t-test
6m 4s
6. Comparing means with a one-way ANOVA
6m 30s
7. Comparing paired means
5m 3s
10. ### 9. Charts for Three or More Variables

24m 30s
1. Creating clustered bar charts for frequencies
6m 34s
2. Creating clustered bar charts for means
3m 45s
3. Creating scatterplots by group
4m 13s
4. Creating 3-D scatterplots
4m 25s
5. Creating scatterplot matrices
5m 33s
11. ### 10. Descriptive Statistics for Three or More Variables

30m 57s
1. Using Automatic Linear Models
11m 52s
2. Calculating multiple regression
9m 3s
3. Comparing means with a two-factor ANOVA
10m 2s
12. ### 11. Formatting and Exporting Tables and Charts

29m 29s
1. Formatting descriptive statistics
6m 1s
2. Formatting correlations
7m 49s
3. Formatting regression
10m 19s
4. Exporting charts and tables
5m 20s
13. ### Conclusion

51s
1. What's next
51s

### Start learning today

Sometimes @lynda teaches me how to use a program and sometimes Lynda.com changes my life forever. @JosefShutter
@lynda lynda.com is an absolute life saver when it comes to learning todays software. Definitely recommend it! #higherlearning @Michael_Caraway
@lynda The best thing online! Your database of courses is great! To the mark and very helpful. Thanks! @ru22more
Got to create something yesterday I never thought I could do. #thanks @lynda @Ngventurella
I really do love @lynda as a learning platform. Never stop learning and developing, it’s probably our greatest gift as a species! @soundslikedavid
@lynda just subscribed to lynda.com all I can say its brilliant join now trust me @ButchSamurai
@lynda is an awesome resource. The membership is priceless if you take advantage of it. @diabetic_techie
One of the best decision I made this year. Buy a 1yr subscription to @lynda @cybercaptive
guys lynda.com (@lynda) is the best. So far I’ve learned Java, principles of OO programming, and now learning about MS project @lucasmitchell
Signed back up to @lynda dot com. I’ve missed it!! Proper geeking out right now! #timetolearn #geek @JayGodbold
Share a link to this course

### What are exercise files?

Exercise files are the same files the author uses in the course. Save time by downloading the author's files instead of setting up your own files, and learn by following along with the instructor.

### Can I take this course without the exercise files?

Yes! If you decide you would like the exercise files later, you can upgrade to a premium account any time.

How to use exercise files.

Learn by watching, listening, and doing, Exercise files are the same files the author uses in the course, so you can download them and follow along Premium memberships include access to all exercise files in the library.

Exercise files

How to use exercise files.

This course includes free exercise files, so you can practice while you watch the course. To access all the exercise files in our library, become a Premium Member.

Are you sure you want to mark all the videos in this course as unwatched?

This will not affect your course history, your reports, or your certificates of completion for this course.

Congratulations

You have completed SPSS Statistics Essential Training (2011).

Become a member to add this course to a playlist

Join today and get unlimited access to the entire library of video courses—and create as many playlists as you like.

Become a member to like this course.

Join today and get unlimited access to the entire library of video courses.

Exercise files

Learn by watching, listening, and doing! Exercise files are the same files the author uses in the course, so you can download them and follow along. Exercise files are available with all Premium memberships. Learn more

How to use exercise files.

Thanks for contacting us.
You’ll hear from our Customer Service team within 24 hours.

Please enter the text shown below:

The classic layout automatically defaults to the latest Flash Player.

To choose a different player, hold the cursor over your name at the top right of any lynda.com page and choose Site preferencesfrom the dropdown menu.

• Mark video as unwatched
• Mark ALL videos as unwatched
Exercise files

Access exercise files from a button right under the course name.

Mark videos as unwatched

Remove icons showing you already watched videos if you want to start over.

Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.

Interactive transcripts

Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.

## Are you sure you want to delete this note?

Thanks for signing up.

We’ll send you a confirmation email shortly.

• new course releases
• general communications
• special notices

Keep up with news, tips, and latest courses with emails from lynda.com.

• new course releases