From the course: Machine Learning & AI Foundations: Linear Regression

Challenges and assumptions of multiple regression - SPSS Tutorial

From the course: Machine Learning & AI Foundations: Linear Regression

Start my 1-month free trial

Challenges and assumptions of multiple regression

- [Instructor] Okay, now we've arrived at a terribly important topic. It can come off as a bit technical, but we really wanna do this thoroughly, so we're about to talk about multiple linear regression. Quite simply, that's when you have one dependent variable, but multiple independent variables. Now, there's more than one kind of regression, but overwhelmingly the most common is called ordinary least squares regression. And maybe you haven't even heard that phrase, but when someone said I built a regression model, they almost certainly meant ordinary least squares. And this is, again, the most common type, and it's the type of regression we're gonna discuss in this course. Before we walk through the assumptions, I wanna let you know that it's possible to really go down the rabbit hole with these assumptions. So I'm gonna give you a high level overview, but if you wanna dig deeper, there are two resources that I'm gonna recommend. One is Multiple Regression: A Primer by Paul Allison, and if you really wanna go into it, then a textbook that is very frequently used is Applied Multiple Regression by Jacob Cohen, Patricia Cohen, and then two additional authors, Aiken and West. Now that's a big book, but it really gets into the details of the entire process, including the assumptions here. They're very formal, very academic in style, that second book. Okay, let's begin by talking about what's usually referred to as the specification errors of regression. Basically, it's just this notion that regression takes on a very particular form. So the predicted value of your dependent, that's gonna be called the Y in this formula, is gonna be built up from three things, the beta zero, which is the Y intercept, or sometimes called the constant, and then a whole series of pairs, the beta coefficient in the variable, and then since we're doing multiple regression, you're also gonna have a beta two and an X two, and a beta three and an X three, depending on how many variables you have. Finally, you're gonna have an error term. So again, this always takes on a particular form. It's a constant, a series of beta one X one pairs, and an error term. So where can you go wrong? Well, first you gotta make sure to put all the relevant variables in the model, why? Probably sounds obviously, but it's kinda subtle. It's because the error is supposed to be random. So if you don't put all the relevant variables in your model, there's going to be systematic error and you want the error to be just random error. So do your best to try to find all the variables that can help predict that dependent and then you wanna make sure that you don't have any variables in the model that aren't relevant. That's just gonna create noise in your model. This is really about getting a good signal to noise ratio if that metaphor is helpful to you. Finally, the relationships have to be linear. And we're certainly gonna practice looking for linear relationships and things that might be a departure from linear relationships when we look at our data visually. When we encounter non-linear relationships in regression there actually are techniques to address those which we'll be discussing in the course. Okay, let's pull the lens back so to speak and talk about these assumptions more broadly. This is gonna be a kind of to do list which we're gonna wanna use when you're exploring your data, so when you run your regression model your residual should have a mean of zero. And this gonna be visible in the SPSS output. Frankly this one isn't gonna be that common of a problem. Normality of errors, a lot of folks when they learn about multiple regression think that what they've learned is that that the independent variables have to be normally distributed. Technically this isn't true. We have to have normality of the errors. However, until you've built the model you haven't generated an error term yet so in practice we do check for the normality of the independent variables and that's something we're gonna be practicing. Residuals are not auto correlated. This is another one of the rules. This too is something that you may not encounter but there is a formal test called the Durban Watson test that can be run in SPSS that we're gonna see. Normally autocorrelation is associated with time series data like stock prices or economic data and so on. So it can sometimes be a problem and it can be checked for with the Durban Watson test. Again you need linear relationships that we're gonna be checking for when we run our scatter plots. And finally, and when you reflect on this one it might seem strange, but people do make this mistake sometimes. You need more observations than you have variables. So let's say that you were doing case studies of very big companies. You might be looking only at Fortune 500 chemical companies and maybe you only identify about a dozen. If you have 20 variables that describe those 12 chemical companies you now suddenly have more variables than you have data. Again the whole notion might seems strange to you if your data sets are large but people actually do make this mistake sometimes. Finally a lot of times this will be listed as you cannot have multicollinearity, but frankly in a real world you probably will have some. The problem is whether or not multicollinearity becomes severe, so what is multicollinearity? We'll have a whole discussion of this it's a big topic and terrible important to your understanding of regression. Let's pull back a little bit further. What problems might you encounter when you start to do multiple regression that you didn't face perhaps when you're only trying to do simple linear regression with a single independent variable. Well one challenge is visual examination becomes more difficult. You can look at each independent variable against the dependent one at a time, that's not a problem, but you can't really look at all of your variables in a single scatter plot. So it's very difficult to see how the variables are bouncing off each other as it were. Simple linear regression never produces multicollinearity because multicollinearity is what happens when you're independent variables are correlated with each other so this is becomes a new problem that we have to discuss in this environment. When have a single variable you don't have to worry about your independent variables interacting with each other but now that you're doing multiple regression with more than on independent variable you have a whole new set of problems to worry about. One of the most important challenges, one that we're gonna invest a lot a time and thought into, is how to attribute the importance to each of your independent variables. Now that you have a collection of them you might be trying to prove the importance of one. For instance trying to prove the importance of a experimental effect as opposed to a placebo effect. This too is a big new set of challenges that we'll be discussing at some length. Finally, and I think this will seem straightforward to you you're juggling multiple problems at once. So it's not just a matter that you have an outlier on one variable, but your have outliers on other variables too and you're trying to deal with all of this and sometimes it'll be hard to know how to address one problem without causing another problem to pop up someone else in your analysis.

Contents