Learn how to apply principal component analysis.
- [Instructor] Singular Value Decomposition is a linear algebra method that you use to decompose a matrix into three resultant matrices. You do this in order to reduce information redundancy and noise. SVD is most commonly used for principle component analysis, and that's the machine learning method we're going to discuss in this section. But first let me give you a brief refresher, if you've taken linear algebra, on how SVD works. You can see here we've got our original matrix. This is our original data set, it's called A, and we decompose it into three resultant matrices, U, S and V.
U is the left orthogonal matrix, and it holds all of the important non-information-redundant information about the observations in the original data set. V is the right orthogonal matrix, and it holds all of the important non-redundant information on features in the original data set. S, this is the diagonal matrix, and it contains all of the information about the decomposition processes that were performed during the compression. Let's look at principal component analysis.
Like I said, this is the most common application of SVD. Principal component analysis is an unsupervised machine learning algorithm that discovers relationships between variables and reduces variables down to a set of uncorrelated synthetic representations called principal components. For an example, imagine you work for a major grocery store chain and you've got some data that was generated by customers making purchases with their rewards card. The data set describes customers and the products they purchase.
You need to identify what key factors, perhaps like age or income, that most affect the customer's purchasing behavior. You can use PCA to decompose your customer purchasing data into one vector that describes the factors that influence the customers' purchasing behavior and another vector that describes the probabilities that products will be purchased based on those key influencing factors. Principal components are synthetic representations of a data set. Principal components contain all of the data set's important information but do not include the noise, information redundancy and outliers that were present in the original data set.
As far as what you can do with PCA, you can use PCA for fraud detection, spam detection, image recognition, speech recognition, or also for outlier removal, if you are using PCA for data pre-processing. You may be wondering how you can use factors and components. Well, both factors and components represent what is left of the data set after information redundancy and noise have been stripped out. You use them as input variables for machine learning algorithms to generate predictions from these compressed representations of your data.
Let me show you how PCA works in action. The first thing we need to do is import our library, so we're going to import numpy as np, and we'll import pandas as pd. We've been doing this all throughout the course. I'm going to copy and paste in the data visualization libraries, 'cause it's a lot to type out but just make sure you import matplotlib, pylab and seaborn. And then lastly I'm going to show you, you want to make sure to import scikit learn, so that's import sklearn.
Then you'll also want to import the decomposition module from sklearn, so you're going to write from sklearn import decomposition and the from sklearn decomposition, you want to import PCA .decomposition, and we're also going to be using a scikit learn data set, so we'll say from sklearn import datasets.
And then we just execute that. Now I'm going to copy and paste in the data visualization parameters that we've been using throughout chapter two, so that our data visualizations print out nice. Let's get into the PCA part of this demo. First we'll load our data set, and this is actually, we're going to load the exact data set we used in the vector analysis section of the course, so to load the iris data set, we just called the load.iris function, and we'll call our data set iris.
And we want to create a variable that represents the data. The numerical variables in the iris data set. We'll call it X. And we'll want to isolate the feature name, so we can use that for column headers in our data frame later in the analysis, so we'll do that here, and then let's just print it out. Okay, so this is our data set. You get an idea of what it looks like on the inside. The next thing we need to do is to instantiate a PCA object called the fit method in order to find the principal components, and then apply the dimensionality reduction on X, our data set, by calling the transform method.
We do this by writing pca = decomposition.PCA() so that pca is going to be our PCA object, and then we're going to say pca.fit_transform to fit and transform. And we're going to call the output iris_pca. Lastly, let's print the explained variance ratio attribute, in order to see how much variance is explained by the components that were found.
I'm going to take you over to another screen and explain this variance ratio to you a little more here in just a second. So here is our explained variance. We also want to calculate the cumulative variance. We do that by accessing the explained variance ratio attribute of our pca object. We'll write pca.explained_variance_ratio_ and then take the sum value.
Great, we got a one. Let me show you what that means on another screen over here. First you need to understand what explained variance ratio is. This ratio tells you how much information is compressed into the first few components. You use explained ratio variance to calculate a cumulative variance. Then, with this cumulative variance, you can figure out how many components to keep. You just need to make sure that you keep at least 70% of the data set's original information. So turning now back to the results from our analysis, when we sum up the variance that is explained by all of the components, it adds up to one.
This is our cumulative variance. This means that 100% of the data set's information is captured in the four components that were returned. That's great, but we don't want 100% of the information back, remember? Some of that information is tied up with noise, information redundancy, or it represents outliers. Our goal with PCA is to remove all that junk from the data, and keep only the fundamental or principal components that matter. Look at the explained variance ratio.
We see that the first component explains 92.4% of the data set's variation. That means it holds 92.4% of the data's information in one principal component. Pretty cool, right? And by taking the first two components, we only elude 2.3% of the data set's information. That's the junk we want to get rid of anyway, so let's do that. We will take only the first two components and feel satisfied knowing that they contain 97.7% of the iris data set's original information.
Now let's make a data frame so we can look at our principal components. We call the data frame constructor on the components attribute of our pca object here. The components attribute represents the components with maximum variance. Let's also pop in an argument columns equal to variable names, so we can get a good understanding of what we're looking at in our output data frame. We'll do all of this by first calling the data frame constructor, and then passing in pca.components_ and then also columns=variable_names.
We'll call this whole thing comps, these are our components, and print it out. Let's also look at a heat map, a correlation heat map here, to see how the data set's variables correlate with the principal components. So we'll use seaborn's heat maps function, sb.heatmap, pass in the comps object. Let that print out, and there we have it. Now let me take you back over to the other screen and we will discuss these results.
So the explained variance that we looked at earlier told us that the first two principal components contained over 97.7% of the data set's total information. Based on that information, we decided to keep only those two components. The results from this correlation heat map show that, one, principal component one is strongly positively correlated with petal length and moderately positively correlated with sepal length and petal width. Component one is slightly negatively correlated with sepal width, and two, principal component number two is strongly negatively correlated with sepal length and sepal width, and slightly negatively correlated with petal length and petal width.
You may be wondering how you can use these components once you've isolated. Well, you can use them as input variables for machine learning algorithms. So in the case of the iris data set, you could use the two components we've generated as input for a logistic regression model to predict species labels for new incoming data points. This will become more clear after you've learned how to do logistic regression. You'll see that in chapter eight, so hold on.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup