Show how component loading helps define components.
- [Instructor] Let's have a look at how you can do Principal Components Analysis, using the freeware application R instead of Excel. R has several ways of doing Principal Components Analysis. The one that I'm partial to, and that I will demo here, is called Principal, and it's found the in psych package. That's psych, all lower case. You will need to have installed that package. The course called R for Excel Users, describes how to install packages in R. After you have opened R, enter the command library (psych).
After you have entered that statement, R doesn't normally do anything except provide another command prompt for you. The library statement just makes the psych package available to you. You first need to read the data into the R workspace. I have saved the data in comma separated values file, in R's working directory. We can read data into an R data frame, named PCAdata, Principal Components Analysis data, using the Read.csv command, just like this.
The header equals upper case T argument to the Read.csv function, is short for header equals true. It simply informs R, that the first row, the header row in the file, contains variable names. You could, instead of using the Read.csv function, use the XLGetRrange function, which is part of the DescTools package. I have found, though, with many records, and this file has over 20,000 records, reading a CSV file takes much less time, than importing the data directly from an Excel worksheet, using the XLGetRange function.
Therefore, here I'm using the Read.csv function. It reads the data from the CSV file into an R data frame. If you'd like to get a quick look at some of the data that you have imported, you can use the head function, like this. Now, let's run the Principal Components Analysis. We'll use the Principal function. We have access to that function because we called the psych package, using the library function earlier. That command runs the Principal function on the data frame named PCAData.
It calls for three factors. That's the nfactors argument. And it calls for the varimax factor rotation. There are several approaches to factor rotation, and of the orthogonal rotation methods. Varimax is likely the most popular. Finally, the statement calls for each record's factor scores. It does so, by setting the scores argument, to the upper case letter T, which is just short for true. Because we have stored the results in an object called PCAModel, we won't see them until we use the Print command, on that object, like this.
In response, R shows the factor loadings for the three factors, by each of the 21 variables. It also shows a column labeled h2, which is more usually shown as H squared. It's the symbol for the communality of each variable. The communality is the sum of the squared loadings, for each variable, on each retained factor. The next column, labeled u2, is the other side of the coin of the variables communality. It shows the uniqueness or the unique variance in each of the variables.
The final column, labeled com, is Hoffman's index of complexity. I won't get into that in this course. Following the loadings matrix, we get the eigenvalues. R labels the eigenvalues as ssloadings, or the sum of the squared loadings for each factor. Following those three item values, in this case, 3.04, 2.29, and 2.23, we get the variance explained by each of the retained factors, the cumulative variance explained, and the proportion and cumulative proportions of variance explained by each factor.
By the way, the eigenvalues or ssloadings that R provides, are the rotated eigenvalues, which is a reason that they differ from un-rotated eigenvalues, discussed in this chapters first video. With over 22,000 records, it's not feasible to show the factor scores in the R Console. If you want the scores, as frequently you will, it makes more sense to write them to an output file. The following command, writes the scores in the PCAModel object to a CSV file, called PCAscores.csv.
Notice the dollar sign between the object produced by the Principal Components Analysis, which is PCAModel, and the item within the model, that we want to write to this CSV file. And that's the score's item.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert