Learn about parametric methods.
- [Instructor] Let's talk about parametric correlation analysis. Parametric correlation analysis is a method you can use to find correlation between linearly related continuous numeric variables. Don't worry if you don't exactly understand what that means I'm going to show you how to figure that out in a minute. First I want to explain one important point about correlation. Correlation does not imply causation. Let me explain. Imagine you're an Ophthalmologist studying regional obesity trends.
You have two data sets. One on store size reported by zip code, and two on national obesity prevalence, broken down by zip codes. In the course of your investigation you apply the Pearson correlation method, that's the method I'm about to show you, and you find that there is a very strong positive correlation between grocery store size and obesity. The bigger the grocery stores the more obesity there tends to be. Of course, the size of the store doesn't cause the obesity, but they are correlated.
And that correlation is quantifiable through the Pearson method. Like I said, in this demonstration I'm going to show you how to calculate the Pearson correlation coefficient. That's represented by the variable R. If you get an R value of one or negative one, these indicate a strong relationship between two variables. But if you get an R value of close to zero this indicates that variables are not linearly correlated. It's really important that you understand the Pearson correlation assumptions.
These assumptions are your data is normally distributed, you have continuous numeric variables, and your variables are linearly related. In the demonstration I'm going to show you how to test your variables to see if they qualify for the Pearson correlation Analysis. One thing about the Pearson correlation Analysis is that the assumptions are kind of strict, so what a lot of people do is they use the Pearson correlation to find linear correlation between variables. But if they get an R value that's close to zero they don't rule out the possibility of other relationships existing between the variables, like non-linear ones.
Now it's time to look at how you can calculate Pearson correlation coefficient using Pandas and Scipy. The first thing we need to do is input our libraries so we'll import Pandas and NumPy, like we've been doing throughout this course. And then we're going to import our standard data visualization libraries that we covered in chapter two, and then I just want to mention that in a portion of this demo we use the SciPy library so we need to import Pearson correlation method from that.
So we'll say import scipy and then from the scipy stats module we'll import pearsonr that's the name of the function we're going to be using. Next let's set our data visualization settings like we have been throughout chapter two, and again, we're going to use the mtcars data set. So we'll load that, and the first thing I want to show you is how a scatter plot matrix really comes in handy if you want to review whether the method's assumptions are met.
As you saw in chapter two, you can easily generate a scatter plot matrix using Seaborn's pairplot function. So let's call pairplot on our cars dataframe. As you can see when we generate a scatter plot matrix of all 11 numerical variables in the cars data set, it takes up a lot of space. I went ahead and selected some variables for our analysis and I'll generate a scatter plot matrix of those in order to show you what about them is desirable for the Pearson correlation. Then I'm going to take you into another screen to explain, but real quick, before I do that, let's just make this second scatter plot.
And, in this, we're going to include the mpg variable, horsepower, qsec, and weight, and then let's call the pairplot function on this dataframe. So I need to replace the period with a comma and remove this period. So here is our smaller pairplot, and now let me take you over to the other screen to explain what all this means.
So let's consider the model assumptions for the Pearson correlation analysis. Pearson correlation assumes that your data is normally distributed, that variables are linearly related, and that the variables are continuous, numeric variables. Let's look here at the normally distributed requirement. A normally distributed requirement is going to give a shape like a bell curve in a histogram. I wouldn't say that all of these variables (laughing) are exactly normally distributed, but they could possibly be close enough in order to generate some sort of correlation using the Pearson correlation method.
So, I'm going to go with these. Now let's look at the requirement for linear relationship. Do these variables have a linear relationship between them? In other words does one increase while the other decreases? Based on the shape of the distribution of points between the variables it looks like most of these have a distribution that could be at least close to linear. So, I'm going to test them out with this Pearson correlation method. The last requirement is that the variables be continuous, numeric variables.
The best way for me to show you why I think that these are all continuous, numeric variables is to show you what a variable looks like when it's not a continuous, numeric variable. If you look over at this scatter plot over on the right these variables over here are not continuous, numeric variables. These are categorical variables because they can only assume a fixed number of positions like we just discussed in the last segment. So this variable can assume one of two values, zero or one.
That makes it a binomial variable. And the gear variable it can assume three values, three, four or five. That makes it a multinomial variable. These are not continuous, numeric variables. When you see continuous, numeric variables the scatter plot of the variables is much more random and evenly distributed. The end conclusion here is that the variables that are shown on the right would not qualify well for the Pearson correlation analysis.
Two last things I want to say about this page is you see these points here, in the histograms that kind of stick out. Those possibly could be outliers. Another thing that I wanted to say here is that we didn't actually do any quantitative test to determine whether or not these variables met the assumptions of the Pearson model. We just kind of eyeballed it, and so, like I said before, when we get the results from our analysis, we don't want to disqualify any variable pairs as not being correlated based on a low score for R.
But what we will do is if we find a R value that is close to one or negative one, we will assume that those variables are linearly correlated. Now going back to our coding demonstration I want to show you first how to use ScyPy to calculate the Pearson correlation coefficient. The first thing we're going to do is we're going to isolate our variables. So we'll take the mpg variable. I'm just going to copy and paste these in. Let's isolate the mpg variable, the hp variable, qsec and weight.
Now to calculate the Pearson R correlation coefficient you just call the pearsonr function on the variable pair. We'll do that for each of the variable pairs. Let's call the pearsonr function and we'll pass in mpg, and hp. Now we're going to be calculating the Pearson correlation for the mpg hp variable pair, and we want to get the pearsonr_coefficient and the p_value.
This is the p value of the test. And the last thing we're going to do is we're just going to print out a label for our Pearson correlation. I've set it to print out with three decimal places, but, of course, you could change that to however many decimal places you want. Okay, so we see our R value is .776. It's a negative value which means that the variables are negatively correlated. Let's just calculate the Pearson R for the rest of the variable pairs and then we'll discuss the results.
So we call our pearsonr function here, and we pass in the mpg and qsec variable this time so we're looking for the Pearson R of this variable pair and we'll print out that R value. Okay, and then again for the mpg weight variable, pass in mpg and weight. This is the mpg weight variable pair, and then print out the R value.
Now what do these mean? Well, based on the Pearson correlation coefficient of these three variable pairs the mpg weight variable appears to have the strongest linear correlation. Mpg qsec have a moderate degree of linear correlation. You may be wondering "well what do you do with this information once you have it?" When you're doing machine learning or other advanced statistical methods these models often have assumptions that either the features are independent of one another or they do exhibit a degree of correlation.
And you're going to see that later on in this course. So you can use the Pearson R correlation coefficient to establish whether or not your variable pairs meet the requirements of more advanced models. Now that you've seen the long form way of calculating the Pearson R value, let me show you some shortcuts. If you call the .corr method off of a dataframe it will automatically return a Pearson R value for each variable pair in that dataframe. Let's take our x dataframe and call the .corr method off of it.
Call this corr and then print it out. As you can see from the correlation matrix that was produced the corr method produces the same results as the pearsonr function I just showed you. Let's look here and just double check that. Let's look at the weight mpg variable pair. We have a value of - .86, and here we go, - .86 for mpg and weight. That was a lot easier, huh? Another quick way to see the degree of linear correlation between variables is to generate a Seaborn heat map from them.
This will give you a quick at a glance understanding of a variable's correlation. As you can recall from chapter two we create a heat map by calling the heatmap function on the object that we would like to visualize, so we'll say sb.heatmap and then pass in the corr object and we're going to add x tick mark labels that represent the column values and y tick mark labels to also represent the column values.
I'm just going to paste that in cause it's kind of long, and print that out. Well that's nice but what does it mean? Well, the darker shades of red indicate a strong degree of positive correlation as you can see from the legend. Based on what we see, the hp weight variable pair has the highest degree of positive linear correlation. Judging by the darker hues of blue in the grid the mpg weight variable pair appears to have the strongest degree of negative linear correlation.
You'll, of course, see here that when mpg is plotted against itself then it has an absolute value of one. It correlates 100% with itself. That's why these are solid red colors. And then these light shades here indicate the weight qsec variable pair is not linearly correlated. Keep in mind, that doesn't mean there's no correlation between them whatsoever. In the next video, however, I'm going to show you some methods you can use to establish correlation between non-linearly related variables.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup