Learn about linear models for categorical variables, logistic transformation, logistic regression, likelihood, and confidence intervals for model parameters.
- [Instructor] How can we perform model fitting when the response variables are categorical? To answer this question, I'll follow the discussion in David Kaplan's "Statistical Modeling." An experiment with the smoking outcomes data set that we used in chapter three. So, I load packages and the data. I have already written a function to plot nicely the data and the fit predictions. And to improve the strength of our conclusions I have removed cases with age greater than 65, none of whom is alive after 20 years.
If I convert the smoking outcome to a binary number it becomes possible to do ordinary least squares. We'll try that first. Although, we'll see that there is a much better way. Here, I can use a Python trick. Multiplying a Boolean by an integer returns an integer. Let's fit a model that includes smoking status and age as main terms.
The stats models, ordinary least squares, outcome, till dead, smoker plus age. Data will be the smoking data frame, and we can fit and assign to a variable. Let's see what our plot does. The data is plotted as circles, with orange for smokers, and light blue for non-smokers. I have added some jitter, that is, I have moved the points randomly up and down so they don't all lie on top of each other.
The fit is represented by the diamonds at the top. We do see there is an association between smoking and negative outcomes. The smoker term is negative. Let's see the ANOVA table. The association is not especially strong with an F-statistic of eight. You have confirmation of this in the confidence interval the stats models provides for the parameters.
Those rely on specific mathematical assumptions about the data, so they should be taken with a grain of salt. The plot, however, shows a mathematical problem. Some predicted outcomes are larger than one. How can we interpret that? What we need is a way to limit the output of the model to values zero and one, or perhaps even better, to values between zero and one that can be understood as the probability of one of the two outcomes.
This is done by constructing models in the usual way and then applying a non-linear function to the output. One especially useful non-linear function is the exponential logistic transformation, exp over one plus exp. The process of fitting such a model is called logistic regression. Stat models implements it as logit.
I write the same formula for the same data but fit with logit instead of ols. Let's see a plot. We see now that the model is bounded between zero and one and it displays a non-linear behavior, even if we have all linear main terms. The criteria for logistic regression is not minimizing the mean-square error of the residuals. But rather interpreting the model response as a probability function and maximizing the resulting probability of the observed data.
This is a form of maximum-likelihood estimation. So instead of the main square error the simplest way to characterize goodness of fit is the value of the likelihood. Stat models give us its logarithm. Because of the logistic transformation the model parameters are not directly comparable with the least-squares parameters. Smoking reduces the probability of being alive, but not uniformly.
It does so by .1 at the upper end of the ages, and less for younger subjects. From the confidence intervals we see that the association with smoking remains weak. Stats models can tell us a lot more about this logistic regression fit. But understanding these numbers, this requires some mathematical development.
Note that the logistic model probabilities are conditional probabilities. They depend on the value of explanatory variables and they refer directly to the cases in the data set, but not necessarily to the general population. Unless we can determine that the sample is truly representative of the population. The techniques to extrapolate the results from sample to population are beyond the scope of this course. But I encourage to learn more about them.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference