Learn how to apply factor analysis.
- [Instructor] Moving in to factor analysis. Factor analysis is a regression method you apply to discover root causes or hidden factors that are present in the data set but not observable. For example, imagine you're a marketing data scientist, and you must identify actionable customer segments for use in strategic marketing planning. You've got response data from a customer survey, you can apply factor analyis as a simple way to group respondents into meaningful customer segments based on similarities in how respondents answered a specific subset of survey questions.
Factor analysis is a method you used to regress on features in order to discover factors that you can use as variables to represent the original data set. These factors are actually synthetic representations of your data set with the extra dimensionality and information redundancy stripped out. Factors are also called latent variables. Latent variable are variables that are meaningful but that are inferred and not directly observable. There are several assumptions you should know about with factor analysis.
The factor analysis model assumes that you're features are metric, that they are either continuous or ordinal, that you have a correlation coefficient R greater than 0.3, that you have more than 100 observations, and more than five observations per feature. It also assumes that you're sample is homogenous. For those of you that are new to data analysis and data science, don't worry too much about all these details. You just want to have them in the back of your mind as you continue learning.
And then over time you'll start learning what these mean and how to apply them. Let's talk about factor analysis. You actually access the factor analysis model from within Scikit learn. Scikit learn is the machine learning library in Python. It comes from the Anaconda install. Again, if you want to see how to install Anaconda check out the w"hat you should know" video at the start of this course. We're going to use factor analysis to uncover latent variables from Scikit learns built in data set called Iris.
Remember, a latent variable is just a hidden variable that impacts how data is behaving. And as far as built in data sets, those are basically just toy data sets that you can use to practice machine learning. Instead of going through every step, and checking to see that all of the model assumptions are met, I'm just going to tell you, they are. We're fine with doing a factor analysis on the Iris data set. More about the Iris data set. The Iris data set contains four numeric variables that describe three different species of iris flowers.
The four numeric attributes are sepal length, sepal width, petal length, and petal width. In this demonstration, you're going to see how to fit the factor analysis model to the iris data set in order to reduce the data sets dimensionality by uncovering the combination of features that contain the most information. In other words, that contain the most variance in the data set. These will be our factors, or latent variables. So let's get started. The first thing we need to do, like usual, is to import the libraries we'll be using.
So, in this demonstration we're going to be using Nanpy and Pandas just like we have been. So I'll just copy and paste those in. And then we need to import Scikit learn. We can do that by typing import sklearn. You also need to import the factor analysis model from the decomposition module of sklearn. You do that by writing from sklearn.decomposition import factor analysis.
Also for this demonstration, we need to load the Iris data set. We do that by writing from sklearn import data sets. Now this is going to import all of the built in data sets for sklearn, later we're going to load the Iris data set from those. So let's execute this, now we have our libraries that we need. Let me show you how to do factor analysis. Let's start off by loading the Iris data set for analysis. To load a built in data set from Scikit learn, call the load_iris function instead of as a variable we'll call this variable iris.
So we say iris = datasets.loadiris. This is it's own function. Now let's create X variable and set it equal to iris.data. These are the four numerical variables that make up the Iris data set. We also want to import that variable names, we'll use these for putting column headers on our data frame table later in the demonstration. And they're also included in the built in dataset.
So in order to access them we just write iris.featurenames, and this will be the column headers. Now let's print out this X data set. We'll print out the first ten records, and all columns. What you see here is a small preview of the Iris data we're working with. The next thing we need to do, is to instantiate a factor analysis object and find the latent variables by calling the fit method on our data set. We do this by writing factor, that's going to be our object, equals to factor anlaysis.fit.
We also need to make a data frame so that we can look at the latent variables, or the factors, that were found. To do this, we call the data frame constructor on the components attribute of our factor object. The components attribute represents the components, or in other words factors, with maximum variance. Let's also pass in the argument, columns = variable names so that we can get a good understanding at what we're looking at in our output data frame. Here we go, I'll just type this up really quick.
This is our data frame constructor. We access factor.components, this is the component attribute. Then we'll put columns = variable_names. All right, here we've got a data frame back, it's got the column headings but what does it mean?
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup