Learn about the least squares criterion, fitting linear models, tilde formula notation, intercept, main terms, interaction terms, and grand and group means.
- [Instructor] To demonstrate how we fit a model to data in Python, we go back to the Gapminder data set. You first load packages. (computer keys clicking) Disregard this warning, which is basically meant for package maintainers. I have already prepared code to select Gapminder data from year 1985, the year of Live Aid, and to make a scatter plot of child survivor rates versus babies per woman using colors to denote regions and dot sizes to express populations.
Let's try it out. (computer keys clicking) We fit models using the very powerful Python package, Statsmodels. We will only scratch the surface of what Statsmodels can do and we will default to its OLS method. OLS stands for ordinary least squares. Least squares means that models are fit by minimizing the sum of square differences between model predictions and observations. Furthermore, ordinary here means that the model coefficients appear lineally in the model formulas.
That is, they multiply explanatory variables or functions of explanatory variables. But, do not worry too much about these technical details. What we will do will be very intuitive. Statsmodels lets us specify models by way of the tilde formula notation, which is used also in the statistical language, R. The formulas go like response variable, tilde, model terms. For instance: babies_per_woman ~ age5_surviving.
(computer keys clicking) Statsmodels usage goes as follows: we define the model and assign the data. I have imported Statsmodels under the alias SMF, so the model is ordinary least squares, and the formula for my first example it just going to be: babies_per_woman ~ 1 representing a constant. I also assigned a data set, g data, to the model.
(computer keys clicking) Next, we need to fit the model. I'm calling this fit grand mean; we'll see in a moment why. (computer keys clicking) Finally we interrogate the results object. (computer keys clicking) The first thing we can do with it is to use its method to predict, which lets us reproduce model values. We plot them by reproducing colors, but not sizes. Since we'll do this a lot, let me make a function, (computer keys clicking) which will plot the data first, and then do a scatter plot of one explanatory variable against a model prediction.
Let me also set colors, sizes, and a few aesthetic details. (computer keys clicking) One detail I want the function to be generic and so actually plot the prediction for a dummy variable fit. (computer keys clicking) What else have I missed? The property name is not marked, but marker.
(computer keys clicking) As the grand mean name implies, the result of doing this model is equivalent to returning the mean of our response variables. We can see this by comparing the fit parameters, which are held in the attribute params of the results object with a simple mean of the data. (computer keys clicking) The constant term is known as intercept.
If we now introduce the region as the model term, we get the model equivalent by taking means by group. To add a model term, we include it on the right hand side of the formula with a plus. Let me grab code from above. (computer keys clicking) Add the region, fit directly, and assign the results to a new name. (computer keys clicking) We look at parameters.
What we get here is a common constant term, and then offsets four groups minus one of them. To treat all groups in the same way, we would write a constant with a minus. (computer keys clicking) And again, we can compare with the grouped means. (computer keys clicking) Now we try something that we cannot get out of simple means.
We add a quantitative variable, child survival, as a main term in the model. We'll call this fit surviving. (computer keys clicking) At least visually, the fit is improving and the fit parameters are interesting. (computer keys clicking) The interesting one in particular is age by surviving which is a slope or derivative.
It tells us that for every additional percentage point of child survival to age five, the number of babies per woman decreases by .14. Now the constant group terms are large because they theoretically represent the number of children for survivor rate of 0%. If we wish to have a different slope for every region, we can throw in an interaction term, as opposed to a main term, which involves two explanatory variables.
We'll call this surviving by region. The interaction term is written with a colon. (computer keys clicking) Now we see that we have four different slopes. (computer keys clicking) India and China, represented by the large circles, seem to be outliers with respect to the fit. Perhaps we can account for them by including another main term this time proportional to population.
Let's change the name of the fit again and add the main term. (computer keys clicking) Now we see that the diamonds don't simply lie along straight lines, because of the fact of the population explanatory variable. Well done, we have learned quite a bit. In the next video, we see how to tell actually how good any given model is.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference