Relying on binomial variables in a prediction model leads to abnormal distribution residuals, effectively producing two data points for each prediction.
- [Instructor] In the previous lesson, we took a look at a linear regression analysis that predicted weight from height and we saw that the continuous outcome variable, being weight, conformed to the normality assumption such that the residuals for weight were normally distributed at each different level of the predictor variable height. Now, here we're predicting a probability of buying a house in columns A and B, given different amounts of annual income.
So in column A, I've just indicated whether the person winds up buying a house or not buying a house and of course that translates to a probability of purchase of 100% or 0% depending on that person's actual behavior. We want to forecast that probability of purchase on the basis of the person's household income so the person in row two is showing $97,000 worth of income and the person in row three is showing $68,000 in annual income and so forth.
Now, given those two pieces of information, we can derive a regression equation that predicts a probability of purchase on the basis of income. In Excel, you would use trend to do that, the trend function, and the trend function takes as its arguments the predicted variable here in B2 to B31 and the predictor variable here in C2 to C31 so predicting the probability of purchase on the basis of income.
And the trend function begins by calculating exactly what the regression equation looks like, the intercept and the coefficient. And it returns for each record involved in rows two through 31 the result of the regression equation, a predicted probability of purchase and we can subtract that from the actual value of purchase and we can get then the residual so 100% is the actual probability of purchase, 100% in this case is also the predicted probability of purchase for a residual of zero.
In other words, we predicted this person's behavior perfectly. On the other hand, for the person in row three, who does not buy the house, the regression equation tells us that he is 39% to purchase the house with an income of $68,000 and so if we subtract the predicted probability from the actual probability, we wind up with a residual of minus 39%. Now, given those residuals, I have charted them here on a graph that has income as its horizontal axis and is the predictor variable and the residuals on the vertical axis and those are residuals of the outcome.
And you'll notice that it looks very different from the charts that we looked at when we were predicting weight from height and charting the residual weights against height. What we've got here is two parallel lines. In this case, they're declining diagonally and this is exactly the kind of pattern that you will always get with a binary variable as the outcome variable. What we were doing in calculating the residual is subtracting the forecast value called y hat here, y with a caret over it, and that's the predicted value and we're subtracting it from either zero if the person's probability of purchase was actually 0% because he did not buy or it's one minus the predicted probability and that would be 100% minus the prediction for a residual of 62%.
Given that we are subtracting from one and from zero, we are going to wind up with these two diagonal lines. These represent the people who actually wound up buying the house and have a one or 100% as their actual purchase probability. And these represent the people who wound up not buying the house. We're subtracting the predicted probability from zero. Now, we do not have a normal distribution here. For the value let's say of approximately $65,000 in annual income, what we've got is two points.
This point here and this point here. And when a residual can take on only two values, it cannot be normally distributed. So the presence of a binomial outcome variable gets us into trouble with the normality assumption. We have one or two more assumptions to take a look at and we will do that in the next two lessons. The first one is the linearity assumption and we'll take a look at that next.
Learn how to use R and Excel to analyze data in this course with Conrad Carlberg. He takes you through advanced logistic regression, starting with odds and logarithms and then moving on into binomial distribution and converting predicted odds back to probabilities. After this foundation is established, he shifts the focus to inferential statistics, likelihood ratios, and multinomial regression. Conrad's comprehensive coverage of how to perform logistic regression includes tackling common problems, explaining relationships, reviewing outcomes, and interpreting results.
- Recognizing the problems with ordinary regression on a binary outcome
- Quantifying errors in forecasts
- Managing different slopes
- Forecasting odds instead of probabilities
- Limiting probabilities on the upside and downside
- Working with exponents and bases
- Predicting the logit
- Working with original data and coefficients
- Establishing the Log Likelihood
- Interpreting -2LL or deviance
- Establishing a data frame with XLGetRange
- Using the R functions mlogit or and glm
- Understanding long versus wide shapes in data sets