Learn about hierarchical methods.
- [Narrator] Hierarchical clustering is an unsupervised machine learning method that you can use to predict subgroups based on the difference between data points and their nearest neighbors. Each data point is linked to its neighbor that is most nearby according to the distance metric that you choose. I know this sounds a bit confusing, but think of this analogy. Imagine you're a geneticist and you want to identify the functional groups of genes by analyzing gene expression profiles, or profiles that identify all the genes active within a cell.
You decide to apply hierarchical clustering to identify these groups based on the distance between data points and their nearest neighbors in the gene expression profile. Data points that are most similar are clustered into the same genetic functional group. Hierarchical clustering predicts subgroups within data by finding the distance between each data point and its nearest neighbor and also linking up the most nearby neighbors. You can find the number of subgroups that are appropriate for a hierarchical clustering model by looking at a dendrogram.
A dendrogram is a tree graph that's useful for visually displaying taxonomies, lineages, and relatedness. Popular use cases for hierarchical clustering include hospital resource management, business process management, customer segmentation analysis, and social network analysis. You're going to see how to do social network analysis later on in this course. For the hierarchical clustering model, you need to tell the model how many centroids to use. In order to find that, you can look at the dendrogram. Let me show you how to use a dendrogram.
Imagine that you only want a maximum distance between a point and its nearest neighbor. You want that maximum distance to be 150. You would plot out your dendrogram, and then you would set a line where y=150. Then along that line, you look at each point where your dendrogram intersects the y=150 line, so you have one, two, three, four, five. If you wanted to have a maximum distance of 150 between each point and its nearest neighbor, you would have five cluster centers.
You could also set the maximum distance between points to be 500, an in that case you would have two cluster centers. That's what we're going to do in our coding demonstration. Before moving into that though, I want to tell you about important model parameters. You're going to have to set parameters for distance metrics and linkage parameters. Distance metrics can be set as either Euclidean, Manhattan, or Cosine, and the linkage parameters are Ward, Complete, and Average. Basically, with hierarchical clustering, you just try every combination of parameter settings that are possible, and the model that returns the most accurate results is the one you want to go with.
Let's see hierarchical clustering in practice. We're going to need the Non-Pi Library and the Pandas Library, so we'll import those. To build our dendrogram, we're going to use Scipy. We'll say, "Import scipy." Then from Scipy, we will want to import the cluster module and the hierarchy tool. We'll say, "From scipy.cluster.hierarchy "import dendrogram," and "linkage." We also need to import fcluster from that same module.
We'll put "from scipy.cluster.hierarchy import fcluster," and we'll do another line for importing cophenet. Lastly, from Scipy, we want to import the spatial module and its distance tools. We'll say, "From scipy.spatial.distance "import pdist." As far as data visualization in this demonstration, we need natplotlib and seaborn, so we'll import those.
In this demonstration, we're also going to use Scikit Learn to carry out hierarchical clustering. We need to say "Import sklearn" and then "from sklearn.cluster." That's the cluster module. "Import AgglomerativeClustering." That's the same thing as hierarchical clustering. To evaluate our model, we're going to use the Scikit Learn's metrics module. "Import sklearn.metrics as sm." We run that.
Let's start off by setting our print options. We don't want to get too many digits of precision for our float values, so we'll say "np.set_printoptions," and we'll say we want a precision of four, and suppress equal to true. Then we're going just set the plotting parameters the same as we have been throughout the course. We set our parameters for the Jupyter Notebook and run the cell.
In this demonstration, we're going to use our mtcars data set again. We just need to load that, and let's create a subset called X to use those features in our machine learning model. We'll include variables mpg, displacement, hp, and weight. We'll say, "cars.ix" and then we simply select our columns that we want. In this case, that's one, three, four, and six, and we want the values from those columns, so ".values." Our target variable here will be the AM variable.
This variable describes whether a car has a manual or automatic transmission. We'll call it y. "Y = cars.ix," and then we just select the column of the AM variable, which is nine .values. Next I'll call the linkage function on data set x. This function carries out hierarchical clustering on our data. I'll pass in the ward argument, so that the function deploys ward linkage methods. We'll call the output of this clustering function Z.
"Z = linkage." Passed in our X variable, and then we'll say "ward." Okay, I had an extra A. Take that out. Z is the clustering results that have been generated from the Scipy hierarchical clustering algorithm. Now I'll generate a dendrogram, by calling the dendrogram function on the Z object. We want to format our dendrogram so it's easy to read. To do that, we're going to pass in "truncate_mode='lastp', p=12, leaf_rotation=45 "leaf_font_size=15, and "show_contracted=True." This is just some housekeeping stuff.
Let's add a title to the plot by saying, "plt.title" and then we'll pass in a title. We'll call it Truncated Hierarchical Clustering Dendrogram. Create a string, and let's make an xlabel, so we call "plt.xlabel," and on the X axis, we're going to be looking at cluster size. For ylabel, let's say, "plt.ylabel." The y axis represents distance between points.
We'll call that "Distance." Let's set a line on our y axis so that we can count out how many clusters to use in our model. We do that by saying, "plt.axhline" and then passing in the value for the line we want added. We'll say we want a line at y=500, and then we'll create a second line at y=50. Having the line there helps you to make sure you get an accurate count of the predicted number of clusters.
We'll print this all out by calling, "plt.show." Now we have our dendrogram. Let's use what we know about our data set in order to pick an appropriate number of subgroups. We know we're working with a cars data set and that there's the AM variable. AM assumes one of two positions, either zero or one, for automatic or manual transmission, and based on that and the results I'm seeing here in the dendrogram, I'm going to pick two as the number of clusters to use in our model.
You can see from our dendrogram, if we're using two clusters in our model, that's really saying that we have a distance between data points and its nearest neighbors. The max distance is greater than 400, because that's where there are actually two clusters in the model. Based on what you know about the data, two seems like a reasonable number of clusters, especially depending on how much the transmission type variable, AM, affects things. As you can recall from other videos, it assumes one of two possible values, zero or one.
We're going to say that we want a max distance between data points of 500. With 500 set as our maximum distance between nearest neighbors, based on the dendrogram, we have two clusters. Let's create a variable to represent the number of clusters in our model. We'll call that K, and we'll say "K=2." The next thing we need to do is to instantiate and AgglomerativeClustering object. We'll call it "Hclustering" and we'll set "Hclustering = AgglomerativeClustering." The AgglomerativeClustering function from Scikit Learn, and then for the n_clusters parameter, this represents the number of clusters in our model.
We're going to set that equal to K, and we'll also pass in a parameter for affinity. The affinity parameter represents the distance metric as measure of similarity. For this example, we'll set the affinity='euclidean' but then later we're going to mix and match parameter settings to see which provides the best results. Lastly, we'll pass in a linkage parameter, and we'll say "linkage='ward.' Now that we've built our model, let's call the fit method off of it and pass in our data set X.
This will fit the hierarchical clustering on our data. We'll say "Hclustering.fit" and pass in X. We're going to use Scikit Learn's accuracy score function to score our model. We'll pass in our target variable y and the predicted values that have been generated from our hierarchical clustering model and find out what the score is. We'll call "sm.accuracy_score" and then we'll pass in y. These are our true values for our labels, and Hclustering.
Now we'll run this to see how well our model performed. It looks like I mistyped here, three lines up from the bottom. I need to change it to "Hclustering" and change the x to a t. Fit. All right, our model is scoring out with a .78. Now we're going to deploy each and every different combination of parameters that are possible with this particular data set, and we're going to see which produces the best model results.
I'll paste in our model, and then let me change the linkage parameter from "ward" to "complete," and run it. We get a score of .43. Let's do it again, but we'll change the linkage now to "average." Rerun it, and we'll do one more. In this case, we'll have a linkage of "average," but we'll also change our affinity to "Manhattan," and it will run that.
Now what we do is we look at the accuracy scores of our model to see which one performed the best. It looks like the highest score is .78125, and that's with an affinity of Euclidean and a linkage of Average, or an affinity of Euclidean and a linkage of Ward. Each of these two parameter settings look like it's pretty good for this data set, and that's how you perform hierarchical clustering. Next I'm going to take you through Instant Space Learning. We're going to look at the K nearest neighbor classification method.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup