Learn about on-parametric methods.
- [Narrator] Let's talk about nonparametric correlation analysis. You can use nonparametric correlation analysis to find correlation between categorical, non-linearly related, non-normally distributed variables. For an example of where nonparametric correlation analysis could be useful, imagine you're a social scientist that studies smoking habits. You use a nonparametric correlation analysis method like Spearman's rank to test a population for a correlation between income as a bracket, and cigarette consumption of smokers.
You find that higher income individuals are much more likely to smoke more cigarettes than lower income people. I'm about to show you how to use Spearman's rank correlation and chi-square tables to establish correlation between categorical variables. The Spearman's rank correlation method works on ordinal variables. In case you don't know what that is, an ordinal variable is a numeric variable that's able to be categorized. The Spearman's rank method converts ordinal variables into variable pairs and then calculates an R correlation co-efficient by which to rank the variable pairs according to the extent of their correlation.
If this doesn't make too much sense for you now, don't worry at all because I'm going to show you what this means in the coding demonstration to come. Similar to the Pearson's R method we discussed in the last segment, Spearman's rank comes up with an r-value to indicate the degree to which variables are correlated. If the r-value is close to one or negative one, then you know there's a strong degree of correlation between variables in the pair. If the r-value is close to zero however, then you know the variables in the pair are not correlated.
And for the assumptions of the Spearman rank correlation analysis? Spearman's rank assumes that your variables are ordinal. Like we discussed earlier, those are numeric variables that are able to be categorized. The model also assumes that your variables are related non-linearly and that your data is non-normally distributed. Don't worry about these too much now because in the coding demonstration, I'm going to show you how to examine your variables and find out whether they meet these assumptions. Before getting into the coding demonstration though, I want to introduce chi-square tables.
You use chi-square tables to test for independence between variables. If you get a p-value less than .05 then you're going to conclude your variables are correlated. If you get a value of p greater than .05 you'll conclude that your variables are independent of one another. The null hypothesis of this test is that the variables are independent of one another. As far as the assumptions of the chi-square test, you just want to make sure your variables are categoric or numeric.
If you have numeric variables, then you're going to need to make sure you have binned them. And in case you don't know what binning is, now is a great time to get familiar with that term. As an example, imagine that you had a variable that had values between zero and 100. That's a numeric variable. As an example of binning, you could break up that variable into 10 separate groups. 10 groups of 10, and then within these 10 groups of 10, you would just put your data into different categories according to it's numeric values, like this.
Now that you know what binning is, let's move on to the coding demonstration portion of this section. I'm going to show you how to carry out nonparametric methods using Pandas and SciPy. The first thing we need to do is to import our libraries. So, we'll import NumPy and Pandas like we have been and then we're going to import the standard data visualization libraries that you saw throughout chapter two. And then also we need to import SciPy and the spearmanr function from SciPy.
So, to do that, we say import scipy and then from scipy stats module we import spearmanr. Okay, execute that code and we've imported our libraries. Let's also pass in the standard settings that we use for our data visualizations in this course. And for this analysis we're going to use the mtcars data set. So, let's load that in the same way we have been.
Here we go, we've got the first few records. This is what the data set looks like. Now let's make a scatterplot like we did in the last section and see how our variables compare to the assumptions of the Spearman's rank test. In order to generate the pair plot, we just say sp.pairplot and we pass in the name of our data frame, cars. Okay, great. So, we've got our scatterplot matrix but as you can see, since there's so many variables in this data set it's pretty hard to visually see what's going on.
I went ahead and selected some variables for our test. Let me make a scatterplot matrix of these so I can show you why I chose them. Call this data set x. And for our test we're going to use the cylinder variable, the vs variable, the am variable, and the gear variable. Now, we call the pair plot function on that. And a scatterplot matrix we can discuss. All right, so there you have it. Now let me explain to you why I chose these variables.
The first thing I looked at is are these ordinal variables? Well, if they're numeric, but able to be ranked into categories, then yes, all of these variables are numeric and they each assume only a set number of possible values. So, yes, these variables are ordinal. Are these variables related non-linearly? Well, based on this quick glimpse I don't see any linear relationships between the variables so yes, hopefully.
Lastly, is the data distribution of each variable non-normal? Judging from the histograms here, I'd say yes. Based on this reasoning, I decided to test the variables cylinders, vs, am and gear. Now, let's isolate the variables so we can apply the Spearman rank test on them. Let's select the cylinder variable, the vs variable, so, we're going to have vs, we'll do the am variable, and one more, gear, a gear variable.
Now, to calculate the Spearman's rank correlation coefficient, you just call the spearmanr function on the variables. We'll do that for each of our variable pairs. So, we'll write spearman, spearmanr, that's our function, and we'll pass in the name of the first variable pair we want to test, which is cylinders and vs. And we want to pull the coefficient that is calculated by the Spearman rank test and also the p-value of the test. And then for the r-value we want to print that out, so let's make a label here, Spearman rank correlation coefficient, and we want to print out three decimal places, so do that and then print out the r-value.
Okay, so, according to the Spearman's rank correlation coefficient, we get an r-value of .814. Let's do that for the other variable pairs. Just going to copy and paste in the function and then we will pass in the variables. So, for the second test, let's pass in cylinders and am variable and print that out. It's returning an r-value of negative .522. Lastly, let's do a test on cylinders and gears.
Cylinder and gear. Run that test, and we get an r-value of negative .564. So, based on the Spearman's rank correlation coefficient of these three variable pairs, the cylinder variable pairs appear to have the strongest correlation. The other variable pairs do show some correlation but only a moderate amount. That was pretty easy. Now, let's talk about the chi-square test for independence. For the chi-square test you need your variables to be in a crosstab table.
We'll create a crosstab by calling the pd.crosstab function on the variables we're interested in. So, in this case, it'll be cylinders and am. We'll call this whole thing table. Then, to calculate a chi-square value we just call the chi-square contingency function on the values in our crosstab. But we need to make sure we've imported this function, and since we didn't do it earlier in the demo, let's do that now. Import, and then it's chi2_contingency.
And then let's call that function chi2_contingency and we'll pass in our table. And we want to have it calculate the chi-square test based on the values in the tables, so we'll access those by saying .values. We want to return the chi-square statistic, the p-value, the dof value and the expected value and then we'll return our chi-square statistic and our p-value for the test. I'm looking mostly here at the p-value and I'm interested to see that it's .013.
So, let's do this entire process all over again for the cylinders and vs variables. Now, let's discuss the results. Remember that with a chi-square test we need a p-value greater than .05 in order to conclude that the variables are independent of one another. Based on what I see here, none of the p-values are greater than .05, so we must reject the null hypothesis and conclude the variables are correlated. That's how you use the chi-square test.
Now that you know how to establish correlation between categorical variables, let me show you how to transform data set distributions before moving into the machine learning section of the course.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup