See how natural groupings such as clinic versus retail create problems due to self selection.
- One of the reasons that so-called big data can make analysis difficult, is that it's so easy to collect observations on tens or hundreds of thousands of people. For example, suppose that all those people are customers. It's often useful to categorize those people into different classes or groups or clusters and you may have available some variables that enable you to group those people. You may know their state of residence, or their zip code or their general income level, or whether they have purchased goods from you before.
But it's entirely possible that none of those variables is useful to you as a means of classifying the customers. Just knowing their states of residence doesn't necessarily put you in a position to sell them more. Often categories that you did not expect are hidden in the mass of data, especially when online retailing is starting to swamp bricks and mortar. A relatively new method of categorizing people or other living beings, or even objects is called cluster analysis. The initial writing on cluster analysis dates to the 1960s, so although it is a 50 year old technique, it's still younger than most statistical techniques in use today.
The idea behind cluster analysis is to use the data itself, such as state of residence, age, income level, and so on, to establish clusters to which your customers belong. This is done on the basis of how close different people are to one another when all of the variables that define distance are taken into account. The distance between people or between clusters is often measured using Euclidean distance. Although more complicated methods are also popular. Euclidean distance is similar to the distance as defined by the Pythagorean Theorem, the square of the hypotenuse is the sum of the squares of the other two sides.
Here, we chart the number of math and business books owned by two people. We can drop a vertical line down from the upper charted point, and extend a horizontal line across from the lower charted point. That makes it easy to visualize the vertical and horizontal distances between the points. Finally, the hypotenuse of the triangle is measured using Pythagoras as a guide, square the vertical and horizontal distances, sum the squares, and take the square root of the result.
Two broad methods of cluster analysis exist. One method is called the linkage method. I will not be discussing that method which is actually a set of several similar linkage methods, and this course will not cover them. The main reason is that linkage results are shown in a kind a diagram called a dendrogram. And when you have a great many cases, as you generally do with big data, it becomes very difficult to interpret the diagram. Now this is an example. The dendrogram shows the distances at which two people become a cluster or where another person joins an existing cluster.
It's complex enough with only a handful of people. It gets very difficult with thousands of people. Nevertheless, the linkage methods have a good deal to recommend them, and if you anticipate doing much cluster analysis it's a good idea to become familiar with the linkage methods The other broad class of cluster analysis methods is usually termed k-means. I will discuss more of how the k-means method works later in this course. Briefly, it's a trial and error method under which membership in different clusters is tested to see whether the distance between people in the clusters is relatively large or relatively small compared to the distance between clusters.
The drawback, shared by the linkage and the k-means methods is that you need to specify how many clusters the analysis should establish. In k-means methods, you need to decide beforehand how many clusters the analysis should find. This is not always a drawback, because the logic of the situation frequently tells you how many clusters are needed.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert