Recognize that even when the number of variables is manageable, the number of values often isn't.
- Another problem with big data is the number of variables that you have to analyze. It can be very difficult to choose a manageable number of variables for an analysis when you have 10, 20, 50, or even more variables available. Even when you have only one or two variables available, the sheer number of values of one of those variables could be overwhelming. For example, suppose that you have only two variables in a sales database as your product and price. Analysis of the number of units sold for each product seems a reasonable place to start but what if you have 60 products? In that case, a breakdown of the number sold of each product on a percentage basis might not tell you very much.
You might even find that every product on your list is responsible for one or two percent of your total sales. If you combine that sort of difficulty, too many variables, or too many values with a large number of sales records in the analysis that you undertake that uses the individual customer or the individual sale as its unit of analysis, it's bound to be awfully cumbersome. But suppose that you have some way to combine variables. There might be some underlying structure that you cannot observe directly that causes some groups of products to sell well and others to sell poorly or some underlying structure that causes certain groups of customers to buy certain groups of products.
This is the sort of problem that principal components analysis and factor analysis are intended to address and this is a fairly simple example on your screen. We have seven variables in row seven through 13. Each variable measures the rate per capita of each type of crime in each of 50 states, according to a federal report from the 1970s. Running that data through a principal components package resulted in the factor loadings shown in cells B7 through C13.
The loadings are similar to correlation coefficients. The larger the loading, the stronger the relationship between the factor and the variable. Going on that basis alone, it appears as though there are just two factors: One that represents murder. Over here in cell C7, notice that that is a powerful loading on this factor. And a factor that represents all of the types of crime. Each of these types of crime loads at least .72 on factor one going all the way up to .85.
But another step that we often take in this sort of analysis is to rotate the factors and you'll learn more about that later in this course. After rotation, we have murder, rape, and assault loading on factor two, .95, .62, and .90 and robbery, burglary, larceny, and auto theft loading fairly highly on factor one. The two factors appear to represent crimes against property and crimes against people. So here we have reduced a set of seven variables to a set of two principal components.
By analyzing the pattern of correlations between the different variables, it may be possible to identify and describe that underlying structure. You might be able to discover latent variables, variables that you cannot observe directly, but that cause the variables you can observe directly to be strongly correlated. Using principal components or factor analysis can often enable you to reduce the number of variables that you are analyzing from an unmanageable 30 or 60 to a much more manageable three, four, or five.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert