Join Barton Poulson for an in-depth discussion in this video Anomaly detection data, part of Data Science Foundations: Data Mining.
- [Teacher] Let's begin by looking at univariate outliers, so this is where you're looking at one variable at a time. One easy way to do this is to use a variance or standard deviation-based measure, and what you're looking for here is cases that are several standard deviations away. For instance, you might compute what are called z scores. Well, that's one way to do it, but the problem, of course, is that the outliers themselves influence the variance or the standard deviation, and so they make it less likely that they will be seen as an outlier, so that's basically a problematic approach.
A more common one is to use quartiles or percentile-based measures where the distances of each case from the rest of them are based on the interquartile range, or the middle 50% of the scores, and this is probably the most common approach, and it's the one that I generally use. Now, one that's not really statistical per se is to use experience. Maybe you're in a field where there are common standards for unusual scores. So, for instance, in medicine they might say, "If your white cell count is above this level, "you probably have an infection." I know that in psychology they give tests and say if you have a score this high, that that's what they call outpatient level and another score is inpatient level.
So you don't actually have to calculate the variance for those, but you get to use these established standards because they're done with consistent measurement. And I will say, also, that if you're familiar with a field, if you've been working in it a long time, you may have very intuitive understanding, based on your personal experience, of what constitutes a normal score or an abnormal score, and so always be willing to rely on that. And, truthfully, if you use Bayesian methodology, then you an incorporate that.
Let me show you an example of a basic box plot for a univariate statistic. This is just artificial data that I created from a chi-square distribution, and the orange dots are the individual data points, and there's a bunch of them, and the box plot is laid on top of it. The rectangle on the left, the big rectangle, is the range of the middle 50% of the scores, and the thick black line in the middle is the median. And then the standard use is to take this width of that big box, make it 50% larger, and then tack it on to each end, and anything that goes beyond that is considered an outlier.
So, for instance, we don't have any outliers on the low end because it's squished in so much, but on the upper end you see that we have what's called a whisker. That's the dotted line, it goes out to the upper fence. That's the vertical line, and that's one and a half interquartile ranges from the top of the box, and then you mark the outliers separately with circles. So we've got the orange dots that are the jittered data points. I just did that so they wouldn't be on top of each other. And then the circles represent a marker for that data point as outliers, so we got a bunch of outliers in this artificial data.
You can also have bivariate outliers where you're looking at two at a time. Now, one choice here is to use distance measures where you calculate each case's distance from the center. There are a lot of choices for that, and I'll talk about those for multivariate, but the thing is they ignore the possibility of two-dimensional visualization, which is one of the neat things when you're working with bivariate relationships. One option is to show a bivariate normal distribution, which is just an ellipse over a scatterplot, and you look for cases that are outside of that ellipse.
And then a more sophisticated approach is density plots, or more accurately, kernel density estimates. These are like topographical map that follow the density of the data, and they can have irregular shapes for something outside of it, but I'm gonna show you what's probably the most common for bivariate, and that is the bivariate normal distribution. So what I have here is a scatter plot, you'll see this in another video, of searches for data science across the bottom and searches for cluster analysis at the side on a state by state basis.
This data comes from Google Correlate, and most of the states are in this ellipse, the little football here that's in the middle, but we have six outliers. We have Delaware and Maryland and Massachusetts and New York and Washington and California that have unusual combinations, and you'll see they're different. For instance, Delaware is near the middle on searches for data science, but they're very high on searches for cluster analysis. That makes them an unusual combination. California, on the other hand, while it's high on data science, it's below average on cluster analysis.
Again, that makes it an unusual combination, and then Massachusetts is really high on both. And so these are different ways to have bivariate normal outliers. And then we can talk about multivariate outliers. Now, I'm gonna put just two general categories here. There are distance measures that generally measure the Euclidean distance or a straight line distance from the center of the data set or the centroid. The most common version of this is the Mahalanobis distance, which is really just a straight vector measurement of how far something is from the standardized centroid of the data.
That's very common. There are, however, a lot of robust measures of distance. Same idea, but they're not as sensitive to variations in the standard deviation of the variance of these scales. Then there are density measures, and this is where you look at the local density of data in a multidimensional space. The multivariate kernel density estimation is the most common approach, and these are more flexible, and they're more robust. They can also give irregular shapes, and that sounds like a good thing, but the fact is they tend to be really hard to describe and hard to generalize from one situation to another, so that's a trade-off.
It's not insurmountable, but the Mahalanobis distance is really easy to show and describe, and, in fact, let me show you what we have right here. What this is is a ranking of states on about a dozen or two different Google search times, and so we're not looking at the results of any one search term but all of them together, and what we have on the x-axis here across the bottom is the states ordered by their Mahalanobis distance, and what you see is a vertical line there on the right side, and there's only one outlier on that one, and that's Utah.
It looks like it says out, but that's Utah that's an outlier based on the collection of variables that I included in that data set. On the other hand, when we go up the y-axis vertically and we use a robust measure of distance, you see that the criteria is a lot lower and that we have a lot more outliers using that one. That actually is probably a more accurate reflection of reality because the Mahalanobis distance, all those outliers inflate the estimates of the variation, which is why we have only one outlier, whereas we have many outliers when we use a more robust measure.
There are methods for both visual and numerical analysis in terms of identifying outliers. Second, there are means-based methods, like Mahalanobis distance, and there are more robust methods, say, for instance, with univariate, the IQR, and with multivariate, the kernel density estimators. It's nice to have measures that are robust and are not so sensitive, but those are often harder to interpret and harder to generalize, and so it becomes a trade-off of what's important for your particular purposes with your particular data set.
Barton Poulson covers data sources and types, the languages and software used in data mining (including R and Python), and specific task-based lessons that help you practice the most common data-mining techniques: text mining, data clustering, association analysis, and more. This course is an absolute necessity for those interested in joining the data science workforce, and for those who need to obtain more experience in data mining.
- Prerequisites for data mining
- Data mining using R, Python, Orange, and RapidMiner
- Data reduction
- Data clustering
- Anomaly detection
- Association analysis
- Regression analysis
- Sequence mining
- Text mining