Learn about a linear projection method for multivariate data.
- [Narrator] DBSCAN is an unsupervised machine-learning method that clusters core samples from dense areas of a dataset and denotes non-core samples from sparse areas of that dataset. An example of where you would use DBSCAN is imagine you're working on a computer vision project for the advancement of self-driving cars. You've got some line data that's supposed to represent lanes, but you need to be able to predict lanes from lines. You can use DBSCAN to predict the lanes based on the density of the lines or non-density of the lines.
Where dense areas are clustered into core samples, which will be considered lanes, and non-dense areas or sparse areas will be considered non-core samples, non-drivable areas. This way the car knows where to go. You can use DBSCAN to identify collective outliers. Just make sure that the number of outliers you choose is less than 5% of the total number of observations in your dataset You do that by adjusting your model parameters accordingly. Then two important model parameters for DBSCAN are eps and min_samples.
Eps sets the maximum distance between two samples for them to be clustered in the same neighborhood. You want to start with an eps value of zero point one. As far as minimum samples, this is the minimum number of samples in a neighborhood for a data point to qualify as a core point. Again, here, you want to start with a very low sample size. You adjust your parameters until you've got just less than 5% of your total dataset size, labeled as outliers. Here we're going to use scikit-learn to apply DBSCAN in order to identify collective outliers.
We're going to need the pandas library, as usual. So we'll import that. We're also going to use matplotlib and seaborn. So I'll copy and paste those over. And then I want to say import sklearn to bring in the scikit-learn library. And then from that, we're going to access the cluster module, .cluster, and from that we've gone to import DBSCAN. Let's also bring in the counter from collections. So we run that, and then we have our libraries.
I forgot the B. And then we'll set our standard data visualization parameters for Jupyter Notebooks. And with this example, we're going to use the same data that we used for the rest of this chapter. So we're going to copy and paste in the code. This is the iris dataset, and we're going to call the data frame df. Let's print out the first five records. Here we go. We have that. The next thing we need to do is to instantiate a DBSCAN object and call the fit method on our data in order to find core samples of high density and expand clusters from those.
When creating this object, we'll tell Python that we want a max distance of zero point eight between two samples in order for them to be still considered as being in the same neighborhood. We will also say that each point must have a minimum of 19 samples to be considered as a core point. To do that, let's call the DBSCAN function, and we'll pass in a value of eps equal to zero point eight, and min_samples equal to 19.
Then we'll call the fit method off of this and pass in our dataset. We'll call this whole thing model, 'cause it's our model, and print it out. What you see here is all of the parameter settings for our model. As you can see, we've now got an eps value set at zero point eight, min_sample set at 19. The rest of these parameters are all the default parameters for this model. Now that we have our model, let's see what points are coming in as outliers.
Outliers are the records that were returned with a negative one label. So we'll create a data frame called outliers_df, an outliers data frame. And call the data frame constructor, and we'll pass in our data. Remember that we don't want any more than 5% of the data points to be labeled as outliers. To check that proportion, we can call the counter function off of our model labels in order to see how many data points are being assigned to each label. We'll do that by saying print Counter and then passing in model and labels.
We want to access the labels of our model. Let's also print up the records that have been labeled as outliers. And then we can look at our results. And to do that, we just say print, and then we're going to write the name of our outliers data frame, so outliers_df. And then we're going to select our model labels that are equal to negative one. Here are our results. And I'm going to create a quick visualization of our clustering result.
So let's make a scatter plot of our results. We're going to create a blank figure and add an access to it, and then we're going to say that we want our colors in our scatter plot to be assigned according to the model labels. Let's create a scatter plot by calling the scatter method off of our ax object. And for the x value, we'll say that data. We're going to pick our third column, so we'll use the indexer, and then we'll just choose number two. And then for the y value, from our data dataset, we're going to use the indexer and we're going to pick position one.
For colors, the parameter is c, so we say c is equal to colors. We just set that. And we'll give the point size 120. Let's also set the x label. So we call set x label off our our ax object, set_xlabel and say Petal Length. For our y label, we're going to have that as Sepal Width, set_ylabel and put Sepal Width.
Let's throw a title on there. We'll say plt.title and then title of this plot's going to be DBSCAN for Outlier Detection. Okay, looks good, so we run this. And here is a visual display of our results. Now what I want to point out here is our counter returns how many records have been assigned a label of one, how many have been assigned a label of zero, and how many have been assigned the label of negative one.
The records with the label of negative one are considered outliers. That's only 4%, so 4% is pretty perfect. Just remember that you always want to have less than 5% of your original dataset size marked as outliers. It's important to know which records they are, so I created a data frame that returns the row index values for each of those outlier records. That's what you see here. And then our data visualization.
DBSCAN has identified collective outliers. That's why they're all appearing together, in an anomalous portion. The light gray and the dark gray area DBSCAN has considered those core samples. Those are generated from the dense areas of the dataset. What DBSCAN is saying is that this is a non-core sample. This is not part of the dense area. It's from the sparse area of the dataset. And then there's this one straggler out here. But what's circled in red is what we call collective outliers.
Hold on, because in the next section, I'm going to show you more machine-learning methods of the clustering variety.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup