Principal components are combinations of observed variables such as revenue from prescription drugs or from OTC pain medications.
- [Instructor] The information I presented so far is all fairly abstract so let's take a closer look at actually running a principal components analysis and examining its results. I'm going to open the Excel file named Factor.xlsm, and then I'm going to open the Excel file named Product Sales.xlsx. It's important to open files in that specific order. The Excel file named Factor.xlsm contains code that runs the principal components analysis Let's take a look at how it works.
I click the Principal Components item on the Add-ins tab, and in the Input Range edit box, I enter the range address of the product sales including the header row with the variable names in it. Because I have variable labels in the first row, I fill the appropriate checkbox. In many cases it's useful to include Record IDs, but at this exploratory stage it's not. I'm using raw data rather than a correlation matrix as the form of the input, so I click the Raw Data option button.
Now I click the Rotation tab on the dialog box, and select the Varimax method. Because I'm familiar with this dataset, I know that I want to retain three factors, so I enter three as the number of factors to retain. Finally, I click OK. It can take a little while, perhaps as long as a minute or two for the Excel workbook to complete the process. You can keep track of what the code is doing by watching the status bar at the bottom of the Excel window.
When it is finished select the Principal Components worksheet. There is a lot of information on that worksheet, such as the correlation or R matrix, it's inverse, and some statistical tests. The eigenvalues are particularly important. The number of factors to retain is an important decision in principal components analysis. The whole idea is to reduce the number of variables down to a more manageable number of factors.
At the same time you don't want to lose important information by discarding important factors. One good way to make this decision is to examine the eigenvalues, which is shown on the principal components worksheet. The eigenvalues, and there is one for each factor, total up to the number of variables in the input data. So, with 21 input data variables the eigenvalues would total to 21 units of variance. Principal components analysis seeks to explain individual variables by means of the strength of their relationships to the underlying factors.
We're looking for factors that account for, or underline, more than just one variable. So one good criterion for determining how many factors to retain is whether a factor's eigenvalue is greater than one. If it is, than it's accounting for more than just one variable, and that's how I decided to retain three factors. There are actually six factors in this data set, with eigenvalues greater than one, but eigenvalues four, five and six are each just barely greater than one so I decided to retain just three factors for the purposes of rotation.
There are other ways to make this decision, but all are a mix of objective and subjective criteria. There are plenty of other reasons that the eigenvalues and eigenvectors that accompany them are important. Among those reasons is that the factors are extracted from the data sequentially. The factor that accounts for most of the variance is extracted first. Then the factor that accounts for the second greatest degree of variance is extracted, and so on. Furthermore, the factors are all orthogonal to one another. That simply means that the factors have zero correlation with each other, and in turn, that means that each factor measures something unique.
There is no ambiguity about the way that the variance is assigned to each factor. In an orthogonal solution, the sort we're working with here, any variance that's associated with one factor cannot be associated with any other factor. That's due to the way that the factors are extracted from the original set of variables.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert