Join Conrad Carlberg for an in-depth discussion in this video Multivariate nature of clustering, part of Data Reduction Techniques Using Excel and R: Business Analytics Deep Dive.
- [Instructor] In previous lessons, you've had a very quick look at a pair of statistical techniques called the analysis of variance, or ANOVA, and multivariate analysis of variants, or MANOVA. With ANOVA, we use one outcome variable, and with MANOVA, we use multiple outcome variables. You saw one way to decide whether two or more groups are reliably different from one another is to calculate the mean value for each group on some outcome variable and see whether the variability within the group is greater or less than the variability between the groups.
If the differences between groups are greater than the differences within groups, you might conclude that the differences between groups are reliable and that if you ran the same experiment again you would get a similar result. We've also seen that even if our reliable difference exists between the groups, that doesn't necessarily mean that the groups, such as sex or political party, are the cause of the difference in the outcome variable. Without a solid experimental design, all we know is that there is an apparently reliable difference between the groups.
Without that experimental design, the group differences might be due to doughnuts. Cluster analysis does the reverse of what you do in ANOVA and MANOVA. With ANOVA or MANOVA, you have existing groups. They might be existing intact groups, such as Democrats and Republicans in a small town, or they might be groups that you created by random assignment. On the other hand, with cluster analysis, you try out different ways of populating the groups. You put people in groups called clusters, and then you repetitively pull the people out and reassign them.
Each time, you calculate the distances between the clusters as measured by the outcome variables. You continue until you find the ideal composition of the groups. That's the one that maximizes the distances between the clusters while minimizing the distances within the clusters. Now if all you have is one variable, such as income, to distinguish between the groups, that's a simple procedure. If you want to create just two groups, then the most straightforward way of going about it would be to put 50% of the people with the lowest incomes in one group and the 50% with the highest incomes in the other group.
But often we have more than just one variable to distinguish the groups. For example, income, age, and frequency of prior purchases. In that case, we would want to populate the groups or clusters according to not only the differences in people's incomes, but also the differences in their ages and purchase history. Instead of having just two group means to work with, we might have to distinguish the groups by average income, average age, and average prior purchases. Each group has a mean on each variable.
How do you combine them? The combination of principal components analysis and cluster analysis will combine the variables for you. You may recall an earlier lesson in this course discussed the principal components analysis of crime rates in the 50 states. The analysis derived two principal components: crimes against people, and crimes against property. If you take the factor scores reported by the principal components analysis, whether that's done by R or by Excel or some other application, you can subject those principal components to a cluster analysis.
Once you have the clusters, you can analyze the factors as a function of the clusters. In that way, you can analyze types of crime by cluster, different measures of sales efficiency, such as total revenue and number of units by product line. And any other principal component by any other type of cluster. I'll walk you through an example of deriving principal components and clusters and analyzing principal components by clusters in chapter four. In the meantime, the chart of principal components by clusters for that crime rate data shows clearly how rotating the components distinguish the four clusters that the states are divided into.
This chart also suggests the subjectivity that a cluster analysis can involve. You probably need to know quite a bit about the state demographics to understand why states as apparently different as California and New Mexico, New York and Missouri, or Illinois and Florida belong to the same cluster.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert