Join Monika Wahi for an in-depth discussion in this video Plots for checking assumptions in linear regression, part of Healthcare Analytics: Regression in R.
- [Instructor] Welcome to chapter two where we begin our linear regression analysis by making plots to check the assumptions behind linear regression. This movie will first review the assumptions that have to be met by your data in order for you to do linear regression. Next, we'll run two plots that can help us decide if our data meet the assumptions. Let's review the assumptions behind linear regression. Sorry to give you flashbacks, but welcome back to statistics 101. The first is that there will be a normal distribution of the dependent variable or outcome, which is sleep duration in our case, and the independent variable or exposure which is alcohol consumption.
You will see that now that we are doing regression, I use the terms exposure and independent variable interchangeably and I use the terms outcome and dependent variable interchangeably to meet with how we talk about doing regression. So how do we check the assumption of normality? With a normal probability plot. The normal probability plot will have a diagonal line on it and there will be a bunch of circles for our data. If those circles are on the line, our data do not violate the assumption.
Then we have the second assumption which is statistical independence. This means we couldn't actually calculate the dependent variable out of using the independent variable. Imagine for example our dependent variable was body mass index. We definitely could not use height or weight as the independent variable because then we'd violate statistical independence. That is because BMI is calculated using height and weight, but we're not doing that here so that's an easy one to evaluate. Next, there must be a linear and additive relationship between the dependent and independent variables.
There also has to be homoscedasticity. I'll explain both of these things. First, we are making a linear equation to predict sleep duration from alcohol drinking, but this equation will not be perfect. So let's say you do not drink so your estimate of sleep duration will be the Y intercept from this equation and we know it's unlikely that we'll be exactly accurate. It's an estimated Y or a Y hat, but we know the real Y from the data. So if we take Y minus Y hat, we get the residual or as I say the residue left between the predicted value and the real observed value.
The deal is it's not a problem to have residuals, although you'd like them to be not that big. It's just that the residuals should be equally big or equally small across all values of the dependent variable. That would meet the last two assumptions. So the residuals for sleeping only six hours a night should be just as big as the ones for people who sleep 10 hours per night. How you can show that is by plotting the residuals versus predicted values. So the take home message is that to evaluate our linear regression assumptions, we are going to do two diagnostic plots, a normal probability plot and a plot of residuals versus predicted values.
I made this code called 245_Diagnostic Plots just for that purpose. If you just completed the previous course and are still in the same session in R, you do not have to read in our analytic data set. However, if you are starting new for the day, you have to run this read code. I'm starting new for the day so let me run that read code now. Highlight Control + R. Now to make the diagnostic plots, we'll first do this layout command which makes the plots come up four to a page.
You'll see we use the matrix command and we indicate we are going to have one, two, three, four plots. Then we put this 2,2 to mean that they will be printed out two rows by two columns. Next, we'll plot the regression object and we'll get the four plots out that we can look at, but we really only need two of them, the two I talked about. I also put in the main command so the plot will have a title. Let's highlight all this code and run it with Control + R and there it is.
Okay, remember the plots we are looking for? The first one was the normal probability plot. It's a good idea to save these plots so let's turn our attention to the other plot we wanted to see which was the residuals versus fitted plot that's in the upper left. Okay, we got our plots out. I'll help you interpret them in the next section, but to save them, we can click on this matrix of four plots and then choose Save As JPEG 100%.
We can put this in our other files folder and we can call it dxplots for diagnostic plots. So in this movie, we started by reviewing the assumptions behind linear regression and how to check for these assumptions in our data. Mainly, it came down to running a normal probability plot and also a plot of residuals versus predicted values. In the next section, we'll go over these plots to see if our data meet the assumptions behind linear regression.
- Dealing with scientific plausibility
- Selecting a hypothesis
- Interpreting diagnostic plots
- Working with indexes and model metadata
- Working with quartiles and ranking
- Making a working model
- Improving model fit
- Performing linear regression modeling
- Performing logistic regression modeling
- Performing forward stepwise regression
- Estimating parameters
- Interpreting an odds ratio
- Adding odds ratios to models
- Comparing nested models
- Presenting and interpreting the final model