See that cluster analysis maximizes Wilks' Lambda by repetitive reassignment of subjects to clusters.
- [Instructor] The prior lesson mentioned how the analysis of variants or ANOVA, enables you to judge whether two or more groups are reliably different on some quantitative measure such as revenue, cholesterol level, or miles per gallon, but there's nothing about ANOVA or about its multivariate counterpart, MANOVA, that implies causation. That's a matter for the design of an experiment, not for the statistical test. Suppose that you were studying the effect of exercise versus diet on cholesterol levels. You're working with two hospitals that each have 30 patients who want to lower their LDL levels, LDL is one type of cholesterol.
You have the patients at one hospital go on a vegetarian diet for two months while the patients at the other hospital participate in a program of cardiovascular exercise for the same two months. At the end of the experiment, you find that the average LDL level in the exercise group is 20 points higher than in the diet group. An analysis of variance indicates that a difference that large would come about through random chance only one percent of the time. That is, if you ran the experiment 100 times and there were no differences between the population of people who dieted and those who exercised, you would get a 20 point difference by chance in only one replication of your experiment.
It's more rational to decide that there's a real difference between the groups than that you encountered a one percent accident. But does that mean the 20-point difference in cholesterol levels is due to the difference in treatments, diet versus exercise? No, not as I've described the experiment. I had you work with two intact groups, 30 patients from one hospital and 30 from another. What if a donut shop opened right next door to the hospital where your patients were on an exercise regimen? Those patients might have gone for donuts after their daily exercise period, and that could account for the cholesterol difference all by itself.
Your groups are not equivalent on everything except their treatments, so you can't attribute the post-experiment difference in cholesterol levels to the difference in treatments. Nevertheless, this is still true. The probability of getting a 20-point difference in the samples when the difference in the population is zero is so small that you should conclude that whatever the reason, the populations from which you took your samples differ as to mean cholesterol level. The population of exercisers who patronize donut shops has a higher average LDL level than the population of vegetarians.
But if you have thousands of customers instead of 60 patients, you have just the opposite problem. You'd like to get them to classify themselves into groups. It may be that you can determine that groups A and B turn out to be likely to buy your product while group C is relatively unlikely to do so, and groups D and E are hard to pin down as to their buying habits. Then, you can aim your sales and marketing efforts at a relatively small number of groups, group A and group B in this example, instead of an undifferentiated mass of individual prospects.
How do you go about dividing all those tens of thousands of people into just a few groups? It's simple if you have access to only one variable that describes each person. Suppose that variable is annual income. You just take the 50% of people who were below the median income, and assign them to one cluster, and put the remaining 50% of people into a different cluster, or if you want three groups, you distinguish the lowest, the middle, and the highest 33% of the people. It's more complicated when you have more than one variable.
With, say 10 or 20 variables, you have to judge how much weight to assign to each variable, how best to combine them, and so on, as in this example, where the higher the annual income, the fewer the prior purchases the prospect has made. It's unclear how best to use the information in the formation of clusters, but cluster analysis does that for you.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert