# Combining or excluding outliers

When you start looking at your data one of the problems you might have to deal with is outliers. These are extreme scores, like somebody who is 7 feet tall or somebody who has 26 children or unusual categories, like being Nepali or a Latin Poetry Major. Now sometimes these unusual scores or categories are inherently interesting, like with world records or gifted and talented programs in schools. In other situations, however, they can wreak havoc with statistical procedures that might be designed to look at general patterns, or overall trends.

When you start looking at your data one of the problems you might have to deal with is outliers. These are extreme scores, like somebody who is 7 feet tall or somebody who has 26 children or unusual categories, like being Nepali or a Latin Poetry Major. Now sometimes these unusual scores or categories are inherently interesting, like with world records or gifted and talented programs in schools. In other situations, however, they can wreak havoc with statistical procedures that might be designed to look at general patterns, or overall trends.

In the latter case, where you may be interested more in common scores than in uncommon scores, you have a few choices on how to deal responsibly with the outliers. Now the first question is how to define outliers. Now we've already looked at one way of getting a graphical definition of outliers on a scale variable, and it's with a box plot. I am going to come up to Graphs, to Chart Builder, to Boxplot. I will drag in the 1D Boxplot, and let's look at Market Capitalization.

Also, because we have convenient stock symbols over here, I am going to ask for a Point ID so I know who the outliers are. I will just drag that over here and press OK, and what we see is that the variable for Market Capitalization is extraordinarily skewed, and in fact they often call this pathological skewed. We have Apple here with over \$300 billion in market capitalization, Microsoft, Oracle, and Google, and it just goes down. And we have this huge number of companies that are stuck in a tiny level of market capitalization relatively speaking.

In fact, we have no idea what the median or the mean is because those other scores all get squished together so much that there is 2800 companies in the NASDAQ listing, but we have these extreme outliers that are squishing all the others, that is not possible to really see what's going on. So we know that we have outliers here on a scale variable. Now on a categorical variable, like for instance ethnicity, what you then have as a definition for categorical outliers is that any group that has, for instance, less than 10% of the overall sample would be considered a categorical outlier.

In that situation you have the choice of combining them with other categories and creating a sort of Other category except that it has to be very heterogeneous group. That or you simply don't analyze by that variable in the future. But let's talk about what to do with a scale variable. Now if you don't have very many outliers, or that they're not very far away, you can leave them in. You could take them as legitimate values and you could proceed with that understanding, as long as you communicate it adequately with others.

On the other hand, another choice is to exclude them. Now I don't necessarily mean delete them permanently from the data set, but you can create a selector. We've done this before. I should just mention right here, this is \$100 billion, and we still have a huge number of companies right there. I am going to select a much smaller number. I am going to go to \$100 million capitalization. So I am going to go to Data, to Select Cases. Select Cases if your market capitalization is less than 100 million and press Continue.

Now I have the option of just filtering them out. That creates a new variable that temporarily excludes or deleting them permanently, and I don't want to do that. I am just going to filter them out right now. So I am going to press OK, and it tells me that it has done that selection. And in fact, if I go back to the data set I will see that these cases got, for instance, Apple has been selected out. There is a variable here at the end now. There's a filter variable, and if I click on the value labels, I can see there are cases that are selected or not selected. And now I am going to go back, and I am going to do my box plot all over again.

All I have to do is press OK, but this time I don't have any outliers. In fact, this is a pretty normal-looking box plot. I can see that of the 2800 companies in the NASDAQ, the median level of market capitalization is around \$40 million. The first quartile, the first lowest 25% have 20 million or less, whereas the highest quartile have about \$60 million or less. There are of course hundreds of outliers above these, but these give a nice picture of what you'll call the small capitalization market.

Anyhow, the ability to either combine groups or to temporarily exclude outliers is one good way of dealing with them, as long as you can justify your choices. Again, that gets back to a general statistical principle that you can do whatever you feel is most appropriate and that serves your purposes in telling an analytical narrative. You're telling a story about your data, and if temporarily excluding cases or combining them with other groups serves your purposes best, then go ahead and do that, as long as you can justify your decision to others.

Now, in the next video I will look at another way that does not exclude the cases. It leaves them all in, but changes them by doing what's called a transformation, to let you use all of your data and see if you can still find a way of telling a coherent narrative that way.

