Run R on the data set to derive principal components
- [Instructor] In this chapter, we'll put the principal components analysis together with the cluster analysis to see what the product sales data has to tell us about focusing our efforts. We start by extracting the factors. Open the Factor.xlsm workbook, then open the workbook that contains the variables and the observations that you want to analyze. In this case, that workbook is named Product Sales.xlsx. Click the Add-ins tab and then click Principal Components in the menu commands group.
In the dialog box enter the worksheet address of the data range including the first row, starting with cell B1. Fill the Variable Labels in First Row checkbox. Enter the worksheet address of the Record IDs, starting with cell A2 and ending in cell A21:398. Although you have included the variable names in the first row starting with column B, do not include cell A1 as part of the range of Record IDs.
Select the Raw Data option button, and at the top of the dialog box, click the Rotation tab, choose the Varimax rotation method, and enter three as the number of factors to retain, then press OK. When the analysis is completed the worksheet named Rotated Factor Scores will be active. Notice that it contains, in column A, the record id that identifies each original observation and the factor scores for each observation in columns B, C and D.
Enter the header "Record IDs" in cell A1, save the active workbook that contains the rotated factor scores for the three factors. With the Rotated Factor Scores worksheet active, save the workbook again as a CSV file. This will save only the active worksheet into the CSV file. We're going to click OK to keep the active sheet, and click Yes to keep using this format.
Close the workbook. Excel will ask you if you want to save changes. This is an artifact of saving a workbook in CSV format. Click Don't Save. Now that we have factor scores for each observation let's derive two clusters based on those factors. Start R, choose change dir, or directory, from R's File menu, and set the new working directory to the location where you stored the CSV file.
Enter this command into R's console. That command reads the CSV file, which contains record ids and factor scores into an R dataframe named PCData, for principal components data. Then, enter these commands in R's console. That command creates two clusters based on the three extracted and rotated principal components. Now, write the record ids and clusters to a CSV file.
Close R. In Excel, open the file named Product Sales.csv. Open the file named Clusters.csv. The record ids will be in column A, and the cluster assigned to each record will be in column B. Copy the data in columns A and B in the CSV file named Clusters, and paste them into Product Sales.csv.
Ensure that the record ids in column A match the record ids that you just pasted from Clusters.CSV into column E. Adjust the location if the record ids do not match, otherwise, you can delete the second set of record ids. Provide a header for column E and save the file.
In this course, Conrad Carlberg explains how to carry out cluster analysis and principal components analysis using Microsoft Excel, which tends to show more clearly what's going on in the analysis. Then he explains how to carry out the same analysis using R, the open-source statistical computing software, which is faster and richer in analysis options than Excel. Plus, he walks through how to merge the results of cluster analysis and factor analysis to help you break down a few underlying factors according to individuals' membership in just a few clusters.
- Reviewing the problems created by an overabundance of data
- Understanding the rationale for clustering and principal components analysis
- Using Excel to extract principal components
- Using R to extract principal components
- Using R for cluster analysis
- Using Excel for cluster analysis
- Setting up confusion tables in Excel
- Using cluster analysis and factor analysis in concert