Join Lillian Pierson, P.E. for an in-depth discussion in this video Multiple linear regression, part of Python for Data Science Essential Training Part 2 .
- [Instructor] All right, so let's do some multiple linear regression. Your Jupyter Notebook is coming preloaded with the notebooks you should need, so those are going to be NumPy, Pandas and Matplotlib as well as scikit-learn. Also, I've already gone ahead and set the plotting perimeters for your Matplotlib. The only thing I wanted to add here was that we're going to use seaborn and so we're going to import seaborn as sb and also we need to set the style for seaborn, so we're going to say sb.set_style and we'll just say we want white grid. Lastly, we need to import our counter from collections, so we'll say from collections import Counter, okay? And I'll run this. Now, your Jupyter Notebook is coming preloaded with the data, however, you will need to change this file extension and map it to your own CSV file for the enrollment forecast. Now, let's create a DataFrame for our enrollment data. We're going to call that enroll and we're going to use the pd.read_csv function and we'll pass in our address here. And then we need to just name the columns for that enrollment data, so we'll say enroll.columns and we'll set that equal to a list of column names which are going to be the year, roll, unem, hgrad and inc. Now let's take a look at this data. We'll just look at the head real quick. So we'll say enroll.head. Just so you know, this data is taken from New Mexico and it starts at year 1961. So year one is actually 1961. Roll indicates the enrollment numbers, unemployment indicates the local unemployment in that year and then hgrad is the graduation rate and then income is the local income in the region during that year. Now, according to the assumptions of the linear model, our variables all need to be continuous numeric variables. We also need to make sure that there is a linear relationship between the predictors and the predicant. So let's check for correlation. To do that, we'll use seaborn and we'll say seaborn.pairplot and pass in our enroll data and run this. Okay, cool, so here's our scatterplot matrix. And let's look at the relationship between some of our variables here. We'll start by looking at the relationship between unemployment and enrollment. So here's unemployment and then enrollment is here. So you could say that there may be a linear relationship of some sort here but it could be a lot stronger. But we're just going to call this good enough and see how unemployment does as a predictor anyway. Now let's look at the hgrad and enrollment linear pair. So here's hgrad and here's enrollment and actually, that looks pretty good. There definitely is a linear relationship. We also need to make sure that these variables are continuous numeric variables which just looking at the distribution, you can tell that they are. We need to check correlation, so to do that, we're going to draw on what I taught in the first part of this course with Pearson arc relation. We're going to just call the core method off of our enroll DataFrame. And print this out. Okay, run that. We just want to make sure that our predictors are not completely dependent on one another. That would definitely not be good for a linear regression. So let's look at hgrad and unemployment correlation. We have unemployment on this line and hgrad, okay, wow, so the hgrad variable and unemployment are definitely not showing linear correlation. That's good news. So let's just try those two out and see how they do as predictors with a linear regression model. The first thing we need to do is just create a subset, so we'll call that enroll_data and we'll set that equal to and we'll just select our enroll DataFrame but we only want the unemployment, so that's unem and hgrad variable. And of these, we only want the value, so we'll call .values off of that and then let's also set a target here, so we'll call that enroll_target and we'll set that equal to our enroll DataFrame but we will select only the roll variable here. We'll say .values and then let's set some names of our data. So we'll create a variable called enroll_data_names and we'll set that equal to and then we're going to say unem and hgrad. Okay? And of course, before using our variables as predictors in a linear model, we should scale them. So let's go ahead and do that now. Our predictors are going to be the X data and then our target is going to be our y data. And we want to call the scale function on our enroll_data and on our enroll_target. Okay, and run this. The next thing we need to do per the model assumption is to check for missing values. So let's just use some filtering to do that. We'll filter them out and see if anything comes up. So let's create a variable here called missing_values and we're going to set that equal to X here X is equal to not a number, so we're going to say that's equal to equivalent to np.NAN. That's not a number from the NumPy library and then let's just filter these out. So we're going to then call our X data set and then we want to print out where we have missing values from that dataset, so we're going to say print out the missing_values where our missing_values comes up as equal to True. And we'll run this and we get back an empty array which means we have no missing values. That's perfect. The next thing we need to do is instantiate a linear regression object. So we're going to call it LinReg and we're going to say LinReg is equal to LinearRegression and then we want to say normalize is equal to True. This tells LinearRegression model to normalize our variables before regression and then let's just fit this model to our data, so to do that, we're going to call the fit method off of it. We're going to say LinReg.fit and we'll pass in our data, so X and y and then let's just print out a score for our well our model performs, so we're going to call the print function, then we're going to access our LinearRegression model and we're going to ask for a score and pass in our variables, X and y and run this. And we're getting back an 84% actually, it's going to be 85%. Now, the score that's printed out here is the R square of the prediction. It's a measure of how well the regression line that was predicted by the model actually matches the real values for college enrollment. Basically it is telling us how well the model performs in predicting college enrollment. A maximum good score would be .99 and a minimum score would be .01, something like that. If you see a value of one or zero, you know something is wrong with your model. So this model has an R squared value of .84 which isn't too bad. Of course, this could be better but this is a fast demonstration of linear regression. So next, we're going to move into logistic regression.
- Why use Python for data science
- Machine learning 101
- Linear regression
- Logistic regression
- Clustering models: K-means and hierarchal models
- Dimension reduction methods
- Association rules
- Ensembles methods
- Introduction to neural networks
- Decision tree models