Join Barton Poulson for an in-depth discussion in this video Clustering data, part of Data Science Foundations: Data Mining.
- [Instructor] The algorithm defines what it means to be similar and it also defines how to measure distance and what it means to be a cluster. There are many many choices available. They fall into a few general categories. For instance, there are methods that measure the distance between points like euclidean distance. There are others that measure distance from a centroid, others that look at the density of the data in the space and others that use distributional models. Now, if you have graph data that is network data, there are other models too but I'm gonna focus here on the attribute models.
Let's start by taking a quick look at distance algorithms. Imagine you have a collection of points here in two dimensional space and that's one cluster in orange and you've got another cluster here in blue. What you're going to do is you're going to measure the distance from every point to every other point. Now, I'm not doing it for all of them cause I did this by hand and that would be too many but what you're doing here is you're measuring the euclidean distance. Think of it as measuring the direct distance between each of them and these are also known as connectivity models and one of the neat things you get from these is a hierarchical diagram which I can demonstrate when we're running through this software.
One of the interesting choices here is whether you want to do a joining or splitting approach. Those are actually technically known as agglomerative or divisive. You either start with all of the data in one cluster and then take it apart or you start with all observations separately and then join them together sequentially. One of the tricks here is that these distance models really only find convex clusters. That is if you're in a two dimensional or three dimensional shape, it's gonna find these that are shaped like apples or maybe watermelons but not like bananas cause they curve back in and that's a limitation of a lot of the methods we'll see here.
The other major limitation is that while the distance models are easy to describe, they're really slow for big data cause you have to compute a gargantuan number of distances and work with those so that gets us to some other choices. After distance models, probably the most common are the centroid based models so now we have our cluster of points here and what we get with this diamond is a point that defines the mean in multi-dimensional space here.
Now, I just eyeball this one but what it does then is it looks at the distance from every point in a group to that centroid. The center point here, the centroid, is defined by a mean vector so it's the mean for that group on every variable that you have in the dimensional space and the most common version of this is what's called k-means and that's where you simply define how many centroids you want to have and then it figures out how far each point is from so many different centroids.
Like the distance models, the centroid model finds convex clusters but with an extra limitation. They're generally of similar size and then the other major problem is you have to choose the value for k, the number of clusters that you want to have. There are some interesting adaptation sways around that but generally, you just kind of pick a number and you give that to the computer and see how well it works. Next, there are density algorithms and this is where you look in the multi-dimensional space and it simply computes the density of points in there and it draws a border around them so for instance, here's me drawing a border around this one group of points and then here's a larger, different kind of group and it's got a very different kind of border around it.
So what you're doing here is you're connecting dense regions in k-dimensional space where k is the number of variables or attributes that you have. Now, what's neat about this is these approaches can model nonconvex clusters. You see for instance how the orange one bends in. It can also model clusters of different sizes. The blue one's smaller, the orange one's bigger and so it overcomes some of the limitations of other approaches. One interesting thing is it is possible to sort of ignore outliers or unusual cases because they're not in dense regions.
The major problem however is that it's hard to describe exactly how the algorithm arrived at these things. You don't have a parametric distribution to say that this cluster's defined by high scores on this variable and low scores on that one because it might be an unusual shape that wraps around. The fourth collection of algorithms I want to mention are distributions or distributional models. Now, imagine we have our points here and we draw a sort of normal shape around them in a bivariate model that's gonna look like an ellipse and so here I've drawn an approximate ellipse around these points so what we're doing here is we're getting the clusters and we're modeling them as statistical distributions.
Now, you have a lot of choices but by far the most common is a multivariate normal distribution and I'm showing that with my variate loose ellipses here. There's one major problem to that in that the distributional approach is prone to overfitting. That is if you give the computer free rein, it'll just add more and more dimensions until you end up with models that really only apply to the exact data that you currently have. Now, you can overcome that by telling it how many dimensions to work with sort of like you have to do with k-means and tell it how many clusters.
One advantage though is it's good at capturing the correlations or interrelations between the attributes in your data which none of the other approaches do specifically. On the other hand, it also is limited to convex clusters because that's the shape of our distributions. There are multiple definitions of clusters. There's definitions of what it means to be a cluster, how do they define distance and how to define shape and while there are several categories of algorithms, each of them has different strengths and limitations.
So for instance, some only do convex shapes, some do equal size clusters, some are difficult for large data sets, some are hard to describe parametrically and so when it comes down to actually conducting your cluster analysis, you're gonna need to be mindful of these options and try to choose an algorithm that both fits your data and fits the purposes or pragmatic applications of your analysis.
Barton Poulson covers data sources and types, the languages and software used in data mining (including R and Python), and specific task-based lessons that help you practice the most common data-mining techniques: text mining, data clustering, association analysis, and more. This course is an absolute necessity for those interested in joining the data science workforce, and for those who need to obtain more experience in data mining.
- Prerequisites for data mining
- Data mining using R, Python, Orange, and RapidMiner
- Data reduction
- Data clustering
- Anomaly detection
- Association analysis
- Regression analysis
- Sequence mining
- Text mining