Learn about the univariate method for detecting outliers.
- [Instructor] Now let's talk about outliers. Outlier detection is useful for preprocessing tasks for analysis or machine learning. Or as an analytical method of its own merit. There are three main types of outliers. There's the point outlier, the contextual outlier, and collective outliers. Point outliers are observations that are anomalous with respect to the majority of observations in the future. Contextual outliers are observations that are considered anomalous given a specific context. For example you can think of location and temperature.
An 82 degree day in January in Southern California would not be considered too unusual. But an 82 degree day in January in Moscow, Russia would be considered highly unusual. In fact it wouldn't happen. That's an example of a contextual outlier. Lastly there are collective outliers. These are a collection of observations that are anomalous but appear close to one another because they all have similar anomalous values. And then I show you how to find these using the DBSCAN method discussed later in this chapter.
In the coding demonstration for this section I'm going to show you how to use the Tukey methods. Later in this chapter I am going to show you how to use multivariate methods to find outliers and also how to use machine learning methods like DBSCAN to find outliers in multivariate data. Remember how I said that you could use outlier detection as an analytical method of its own merit? Well you can. You can use it for things like equipment failure notification, fraud detection, and cybersecurity event notification.
Basically, you use outlier detection to uncover anomalies in data. In this section we're going to talk about univariate methods. Tukey methods are useful for identifying a variable's outliers. You can detect unusually high or low data points in a variable by applying the Tukey methods for outlier detection. Data points identified using the Tukey method should be treated as potential outliers to be investigated further. This is a Tukey boxplot. And I wanted to point it out to show you how you can use a boxplot to detect outliers.
Boxplot whiskers are set at 1.5 times the interquartile range. The interquartile range is really just the distance between the lower quartile and the upper quartile. The upper quartile is where 25% of data points are greater than the value. And the lower quartile is where 25% of data points are less than the particular value. Any points beyond 1.5 times the interquartile range are considered outliers. They'll show up in a boxplot visually as the dots that extend past the whiskers of the boxplot.
The other way to use Tukey methods to find outliers is the Tukey Outlier Labeling method. And this is essentially calculating the Tukey outlier mathematically instead of using the boxplot. Let's look at Tukey methods in action. In this demonstration we're going to use NumPy and Pandas. And we're also going to use Matplotlib library. So we'll import all of those. And let's also set the visualization parameters for this Jupyter notebook. Okay in this example we're going to use data from the download that you got with this course.
And we're going to use the iris.data CSV file. We pass in sep equal to comma to indicate that it's a comma delimited file. Now let's set names for each of our columns in the iris data set. And we'll create DataFrames. We'll create an x DataFrame. And use the special indexer to select the first four columns of the iris data. Dot values to access the values. And then let's create a y variable to represent the target.
These are going to be our species names. So we'll use our special indexer, but we're going to now just grab the last column in the data set. And let's print out the first five rows. So this is a representation of what's in the iris data set. Now I want to show you how to identify outliers from Tukey boxplots. You saw how to create boxplots in the last chapter, but let me show you how to find outliers. You would just write the name of our object, and then call the boxplot method off of it.
And here I'm going to pass in return underscore type equal to dictionary. And the reason I'm doing this is, we're going to have a future release, and in order to avoid an error we just type this in there, and it makes python happy. And to print out we say plt.plot and run. So now we have a boxplot. See these points here that lay beyond the whiskers? Those are our potential outliers. So what I did was I just took a quick note of where those outliers were found.
That's the Sepal Width column, and it's values that are greater than four or less than 2.5 approximately. Let's look a little closer at these values. I'm going to use filtering and comparison operators to isolate these values from the rest of the DataFrame. To do that, I first need to isolate the sepal width variable. So we'll call that Sepal_Width. And we're going to select the second column from our DataFrame. And then we want to use the sepal width variable with a comparison operator as a filter.
So we'll say Sepal_Width greater than four. And we'll call the output of this expression iris_outliers. This is really going to return true false values, indicating whether the sepal width is greater than four. We'll use the iris outliers as a filter for each record where the sepal width is greater than four, that record will be returned from the DataFrame. So we'll just write the name of our DataFrame, and then we'll say iris_outliers.
And this returns every record from our iris DataFrame where sepal width is greater than four. Let's also do this for sepal width values that are less than 2.05. Moving the results over here, I just wanted to point out that we now have the row index values for each of the records that are coming in looking suspicious as outliers. Now I'm going to show you how to do Tukey outlier labeling.
This is basically just a manual process of finding outliers if you don't use the boxplot data initialization. First thing we want to do is we want to set our display settings so that we don't get more than one decimal point. And then we'll create a new DataFrame called x_df. Call the DataFrame constructor, and pass in our X variable. And then we'll just generate a description.
And then we've got some summary statistics for each of the variables in our iris DataFrame. Let's see how we can use them to find potential outliers. The interquartile range is the distance between the third quartile and the first quartile. 75% is our third quartile. So we'll say 3.3 minus 2.8. That's our first quartile. The difference between them is 0.5. So we multiply the interquartile range times 1.5 and we get a value of 0.75.
To find outliers from the first quartile, we'll look at the value of the first quartile, which here it's 2.8. And we'll subtract out this 0.75. This gives us a value of 2.05. We see that our min value is even less than that. Which makes it suspicious as being an outlier. Finding an outlier from the third quartile uses the same approach. In this case you would take the value at the third quartile which is 3.3, and you'd add 0.75. That gives us 4.05.
Since the max value in the sepal width column is greater than 4.05, we know that sepal width is suspect for having outliers. That's it for univariate methods to find outliers. Next I'm going to show you some multivariate methods.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup