Join Barton Poulson for an in-depth discussion in this video Examining outliers, part of R Statistics Essential Training.
One of the biggest challenges you can have in data analysis is dealing with unusual or extreme values, such as outliers. In this movie, we're going to look at a few ways of describing what we mean by an outlier, and ways that you can handle them. First I'm going to start with categorical outliers. Now, it's not normal to think of categorical variables as having outliers, because outlier implies distance or position, and categories don't have distance or position. But, what you can have is unusual values. And normally what the means is a category that constitutes less than 10% of the total sample.
The reason for that is because most categorical analyses are using a normal approximation to their original distributions, and if it's less than 10%, then the normal curve no longer serves as a good approximation. The data set that we're going to use for this one actually comes from Wikipedia and it's worldwide shipments of smartphone OS Operation Systems in millions. For the first quarter of 2013, and I've put that into a CSV file, which is available in the exercise files.
And I've currently saved it into my R folder on the desktop, which I find to be a convenient place to keep things. It's a small file, and we're going to use read.csv to bring it in to R and put it into the workspace. Now, if you want to see what's in there we just do View OS, that''ll bring up a window with a little spreadsheet like area. And we'll see that Android apparently is dominating the world market, accounting for 75% of all smartphone operating systems while IOS is accounting for just over six.
And then we have Windows phone, Blackberry, Linux, Symbian and other accounting for diminishing proportions. Apparently the worldwide market is very different from the US market. But those are the actual worldwide data. Now if you want to see it right here in the console beneath it, you can just type in OS and there you get the same data right there. So, an outlier in this case is anything that counts for less than 10% of the total data set. Well we have several. These ones are all less than 10%, Windows phone, BlackBerry OS, Linux, Symbian, and other.
Each of those constitute less than 10% of the global market for a smart phone operating system. What we can do, then, is we have two different choices. One is to combine all of them into an" other" category so we could take Windows Phone, and Blackberry, and Linux, and Symbian, and we can put them all together with other, and create sort of a miscellaneous pile. Now, there's two problems with that. Number one is that in this particular case, those still aren't going to add up to 10% as a group.
The second is that there are times when combining things into an other case really does a disservice to the data. So, for instance, it makes the most sense if you think about if you're doing a survey and you're dealing with racial and ethnic categories. Combining people from various ethnic groups or cultural groups into one homogeneous category could be doing damage to the very important differences between these groups and so. In that sense, you just wouldn't want to use the variable at all. You don't want to delete the people necessarily but you don't want to use that variable because it's not a reliable thing.
In this case, instead of combining them into other because it's still not going to be big enough, the easier solution is to just not include them and so what I'm going to be doing is creating a new data set called OS.hi for anything that's high enough. And I'm using the function subset, and what that means is I'm going to get stuff out of OS so I have subset as my function then I'm going to be reaching into OS. That's what I'm getting a subset OS, and my criterion here is I'm looking for anything where the proportion, all that means that the name of this variable, this column is > 0.1. So anything that is more than 10%.
When i do that, i just have two observations of three variables, its a small data set if you want to see is because now, I have only Android and IOS, but if I wanted to use a data set that has only a substantial number of observations in each category, that's probably what I would do. For a quantitative data, it works a little bit differently. It's easy to work with Box Blocks, let's use a data set called Rivers, that talks about the lengths of major North American Rivers. And we'll just load it in to the workspace.
Let's draw a histogram. And what you see there is that most of the rivers are relatively short, but we've got some really long ones. If we draw a box plot, and I'm going to draw it horizontally, so it's in the same orientation as the histogram. You see, we have a lot of outliers. In fact, the highest non-outlying value is about 1200 miles. We get the statistics that go with that to a box plot stats. Now, we have a list of every one of our outliers. So, one possibility is simply to remove all of the outliers. So, anything above say, for instance 1210 would be considered an outlier.
We can just now look at a data set that does not have those, so I'm using rivers with low values that's what I'm calling it. So I've got that one now, but what's interesting is if I create a box plot of that one, it now has it's own outliers. These were not outliers before. The reason for the change, by the way is because I've changed the total sample size which means I've also changed the boundaries of the middle 50%. And it's that box in the middle that is used to determine whether other scores are outliers. So when you change the total sample size, you will often change the boundaries for outliers.
And so I have an option of getting the statistics again, I've got some more outliers. They start at 1171. I can just remove them again. And I find if I draw a box plot of that one, I still have an outlier, and you can continue with this process ad nauseum. I'm not going to go any further, but just to show you again that one way of dealing with quantitative outliers, if you think it makes sense, is to simply trim them off and make it clear that you're now dealing with a reduced dataset. That's an easy process to do by either using the subset, or in this case, just using a selector, and this means select all the cases where the value is < 1055.
It's an easy way to create a subset of the data. Please note, I am creating a new data set as I do this because I don't want to lose the original one. And then, when I do my analyses, I just have to make clear that I made that selection and why and then I can get along with data that better meets some of the assumptions of the common statistical procedures that I'd be using.
- Installing R on your computer
- Using the built-in datasets
- Importing data
- Creating bar and pie charts for categorical variables
- Creating histograms and box plots for quantitative variables
- Calculating frequencies and descriptives
- Transforming variables
- Coding missing data
- Analyzing by subgroups
- Creating charts for associations
- Calculating correlations
- Creating charts and statistics for three or more variables
- Creating crosstabs for categorical variables