Learn how to apply the logistic regression model.
- [Narrator] Logistic regression is a simple machine learning method that you can use to predict an observation's category based on the relationship between the target feature and independent categorical predictive features in the data set. For example, imagine you're a marketing data scientist for a major telecom service provider. You've got a customer data set that describes each customer with variables like age, income, average call duration, interaction history with customer support, leftover minutes per month and customer status.
Customer status is a variable that describes whether a customer is active or has canceled services. Based on the predictive features in this data set, in their relationship with a customer status variable, you could build a logistic regression model that predicts whether a customer is likely to cancel services in the near future. This is called a customer churn model. Logistic regression differs from linear regression in that, with logistic regression, you're predicting categories for ordinal variables, in linear progression you're predicting values for numeric continuous variables.
Examples of where logistic regression come in handy is purchase propensity versus ad spend analysis, customer churn prediction, employee attrition modeling and hazardous event prediction. Logistic regression has a lot less assumptions than linear regression, but there are some. First is that data is free of missing values. Second, the predictant variable is binary, in other words it only accepts two values, or it could be ordinal, a categorical variable with ordered values.
Third, all predictors are independent of each other. And fourth, that there are at least 50 observations per predictor variable to ensure reliable results. In the demonstration I'm going to show you how to test your data to see if it meets the assumptions of this model. So, let's get started with the logistic regression demonstration. In this demonstration you're going to need NumPy and Pandas as usual, so we'll import those. We're also going to be using SciPy in the Spearman rank test.
You saw that in chapter three. And we'll be using matplotlib in seaborn. We'll import these. And then we need to use scikit-learn for the logistic regression model. So, we'll say import sklearn and then from sklearn there's a number of different tools and packages we need to import from sklearn so I'll just kind of copy this down a few times. So, from sklearn pre-processing, we want to import the scale function, and then from the linear model module, we're going to import logistic regression.
Then from cross-validation module we need to import train test split. We're also going to import some metrics to use for evaluating our model. And we need to make sure that we import the pre-processing tools. So, okay. I've got an extra one here, I'll delete that, run this and we have the libraries we need. And then I'm going to set the data visualization parameters for our Jupyter Notebook like we've been doing throughout the course.
In this demonstration we're going to use the mtcars data set, and I'll print out the first few records just to see what it looks like. Now, I want to use the variables drat and carb as predictive features to predict labels for the am variable. Drat describes the rear axle ratio and carb describes the number of carburetors a car has. The am variable describes whether a car has an automatic or a manual transmission. Before using these variables with logistic regression model though, I need to check whether they meet the model's assumptions.
First I'm going to create a subset of these variables and then check the model assumptions. Let's call the subset cars_data. And we'll use the special indexer, .ix and we'll select column with index value five and 11 .values, and let's also create a list with the names for those columns. So, we'll call that cars_data_names and then just name these drat and carb so we can keep track of all this.
Let's also isolate the target variable for our analysis. We'll call it y and it's just going to be cars.ix and then we're going to select the column with index value nine, that's the am variable. And then we'll say .values to access the values in that column. And then let's run this and start tracking our assumptions. The first thing we're going to check is for independence between features. Are our predictor variables ordinal? Remember that an ordinal variable is a numeric variable that can be grouped into only a limited number of subcategories.
It can assume an infinite number of values. This is really just a shortcut around some of the more advanced assumptions that the logistic regression model makes. And I want to pick ordinal variables for this analysis. In order to do that, I'm just going to use a scatter plot. So, we'll use seaborn and we'll say sb.regplot and for x we'll set x equal to our drat variable, and y equal to our carb variable.
Our data's going to be equal to cars, and then we'll pass in scatter equal to true, saying that we want a scatter chart. And then print this out. So, here's a scatter plot of our two variables. And we can see that these are categorical values. Neither of these variables take on an infinite number of positions, they only take on set positions. Next, let's see if these features are independent of each other. First, let's isolate the variables into a variable called drat and a variable called carb, and we'll say drat is equal to cars and then just select the drat variable.
And same with the carb variable, we'll select the carb column in the cars data frame. And then let's apply Spearman rank since these are ordinal variables. We'll call this spearmanr function and we'll pass in our two variables drat and carb, and let's set the output for this. So, we want the spearmanr coefficient in the p-value of the test. Then we're going to print out a label.
And it's going to tell us what our r-value is according to the Spearman rank test. Okay, so, this variable pair is demonstrating a really almost no correlation. So, that's a good thing, we can use this in the logistic regression model. Next, we need to check the assumption that there are no missing values in the data set. It's a really easy thing to check. We just use the isNull method. We'll say cars.isnull and then we'll call sum off of that.
This is going to return a sum of how many missing values there are in each of the columns in the original data frame called cars. And run that, and as you can see, we have zero missing values. So, that's good. Now, we need to check that our target variable is binary or ordinal. And so, let's use the seaborn count plot function to do that, so, we'll say sb.countplot. This function uses bars to show the counts of observations for each category in a variable.
So, we're testing our target variable. So, we're going to say x is equal to am and then our data is equal to cars, and we'll set a pallette of hls. So, we can see here that our am variable is binary, it only assumes two values, zero or one. So, that's how the suffice the assumptions of the model. Next, we need to check that the size of our data set is sufficient. Remember that you need to have 50 observations for each predictor.
So, in this case, we're using two predictors in our model, so we should have 100 observations. Let's see what we've really got, though. We'll call the info method off of our cars data frame. It returns information about that data frame, and you can see here that according to the range index there's zero to 31, so there's only 31 records and that could be a potential problem because this data set's pretty small. So, you want to keep that in mind when you're considering the results, but for the purpose of this demonstration we're going to continue on.
Now that we've decided that these variables will work with the logistic regression model, let's scale our data. We're going to call the scaled data set x and then we're going to call scale and pass in cars_data. The next thing we need to do is instantiate a logistic regression object. We'll call it LogReg and we'll just set it equal to the logistic regression function. Next, we call the fit method off of the model and pass in our predictor variables as well as our predictant.
This method fits the logistic regression. So, we'll say LogReg.fit and then we pass in x and our target, y. Then let's print out our r-squared value. To do that, we say print LogReg.score, and we see that we have an r-square value of .81, which isn't too bad. If we had a value of 1.0 that would mean that it was a perfect fit.
If we had a value closer to zero, that means that the model doesn't fit at all. I also want to use scikit-learn's metrics, the classification report in order to evaluate our model based on precision and recall. To do that, let's generate some predictive values from our model. We'll call those y_pred and then we'll just call our model LogReg and call the predict method off of that and pass in our x data set.
And we imported the scikit-learn metrics module already but let's just import the classification report from that. So, we'll say from sklearn.metrics. Import classification_report and then we'll print out our results by saying print, passing in the classification report function, and calling that on our y variable, our target variable and our predicted labels for that target variable, y_pred.
And we see here that our total precision for the model is .82, and our recall is 0.81, so we know our model is adequate. And there you have it, you have walked through a basic logistic regression using Python. Next, I'm going to show you how to do Naive Bayes.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup