Join Conrad Carlberg for an in-depth discussion in this video Using R for cluster analysis, part of Data Reduction Techniques Using Excel and R: Business Analytics Deep Dive.
- [Instructor] You can get a sense of how to do k-means cluster analysis in R by analyzing what's called the iris data set. This data set comes with the basic R installation and so does the k-means function. So as long as you have R already installed on your computer, you don't need to download anything else to demonstrate k-means cluster analysis for yourself. You do have to bring the iris data set into R's workspace, so we might as well start by using the Library command on the data sets package, as shown here.
Now you have access to all of the data sets that come with R's base package. One of those data sets is named iris, lower case throughout, and it has data on 150 iris plants. The data include the species of iris represented by each plant as well as the length and width of the sepal and a petal from that plant. If you want to get a glance of what the data looks like, use the head function on the iris data set. That's all the preparation that's necessary.
The next command actually carries out the k-means analysis. We use the k-means command to create a new object and here I've named it k-means Clusters. The first argument to k-means specifies the source of the data and that's the iris data set. You can tell from the head function that the first through the fourth columns in the iris data set contain the sepal length through the petal width. So we specify the numbers of the columns that we want to use to establish the clusters.
We specify 1:4 within square brackets after the name of the data set. The next argument, three, specifies the number of clusters that we want to establish. I choose three because I know that the data set has plants from three species. The k-means function needs clusters to start with at the outset of this analysis and so it creates them. Those clusters are normally created randomly. You can use the End Start argument, I don't use it here, to specify the number of records that the initial clusters should contain, then press Enter to run the k-means analysis.
R puts the results in the new object called k-means Clusters. to see the results of the analysis, just type k-means Clusters at the command prompt and press Enter. To demonstrate for you that two consecutive instances of the k-means analysis can create two different results just because of the fact that the clusters are constituted randomly at the outset, I'm going to run this again, so that you can see the difference between the two runs. In this case, we got clusters of size 96, 33, and 21, which you can tell from the top of the R console window.
So let's start running it again, and we'll see what we got. And notice that this time, we've got clusters of size 38, 62, and 50. I tell you this so you that you won't be surprised if you try the same analysis on the same data twice in a row and get two different results, it's due solely to the fact that the clusters are constituted randomly at the outset. In this case, one cluster has 62 members, another has 38 members, and the third has 50 members. If you look at the full data set, you'll say that there are, in fact, 50 records for each of the three species.
So we have an over count of 12 records for the first cluster, and an under count of 12 records for the second cluster. The designations of the clusters are random, and it takes a little more work to find out exactly which cluster is which. As I mentioned earlier, the k-means method begins with randomly constituted clusters. Keep in mind that because of that random initialization, you might not get the same results from two different k-means analyses, even if both your arguments and your data set are identical across the two instances. Then, in the k-means results, comes a table that shows the mean value of each variable in each cluster, so you'll get a report that shows the mean sepal length and width, the mean petal length and width for the records in cluster one, cluster two, and cluster three.
These are, in fact, the centroids. Then the k-means function details the cluster that each of the records belongs to. In this case, we get 37 records in the first row, each of which belongs to cluster three. In the second row, we have another 13 records belonging to cluster three. The 37 and the 13 account for the 50 that are in cluster three. Look at the top of the window for results for verification of that figure You can do the same court of counting for clusters one and two.
I might mention at this point, that if you use the Print function on the cluster item in the results, you can get the clusters in a list format, rather than in a matrix format as shown in the console. Finally, k-means gives you the sum of squares by cluster. The sum of squares is the sum of the square deviations of each record's value from the mean value for that cluster. Also termed the sum of squares within. You can derive the sum of squares between from the table of cluster means a little higher up in the results.
This gives you a way of determining what proportion of the total sum of squares is accounted for by the sum of squares between. You're in fact, getting close to an F-test with this analysis, but it's only close because you're not correcting for the degrees of freedom. It takes a little more work to compare the clusters that k-means has assigned each record to, to compare the actual species records with the computed clusters. But you want to do that in order to evaluate the accuracy of the cluster analysis.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert