Learn about counts and proportions; tables; stratified tables; unstacking DataFrames; dropping columns; and Simpson's paradox.
- [Instructor] Moving on to categorical variables. How do we describe variation in those? Using tables, of course. So, leaving gapminder aside for a moment, we will use the Whickham data set, this cast by David Kaplan, in his excellent textbook, Statistical Modeling. Then we import packages and then the data set.
The table records interviews with women in Whickham, England, in 1973 who were asked if they were smokers. The interviews were followed up 20 years later, when it was recorded if the woman were still alive. The categorical values in this case smoker and outcome, are both binary, yes or no. We can tally up the explanatory smoker, and the response outcome variables separately.
We use the method value counts and enclosing the results in a data frame creates a prettier output. Doing so doesn't tell us much other than both pairs of groups are represented fairly well in the table. Smokers and non smokers, women who survived for 20 years and those who didn't. If you want to see the values as fractions of the total number of records, we add normalize=true.
This is useful because we know that the fractions sum to one, so later we can drop the death. These fractions are also known in statistic as proportions. We're looking for an association that is we wish to support or refute a claim that two groups are different. There will be some randomness in the results due to the small number of cases, but we'll worry about that in the next chapter. For the moment, we break down the proportion of outcomes by smoker group.
We can do this with group by. This panda series is more complicated than those we've seen so far since the index has two levels. So we can move one of the index levels to columns using unstack. Very well, with this prettier table we can contemplate results. This is somewhat surprising.
It seems that smoking actually improves the outcomes. The problem is that we're not controlling for other variables such as age. For instance, the smokers are younger on average at the beginning of the study, then it stands to reason that more of them would be alive after 20 years. To cast light on this puzzling behavior, we use the simple method of stratification. We divide cases into age groups using Panda's cut. We generate categorical levels based on a set of bins And we'll make an entirely new column for that.
We'll create bins between 0 and 30, 30 and 40, 53, and 64. And choose appropriate labels. Let's see what happened. We have some, not a number for smokers older than 64.
That's okay, we'll just exclude them from this consideration. So let's stratify the proportions. Grouping by age group, and then smoker status. Use value counts to get proportions. And we make a nicer display by unstacking the series and dropping the death from the columns, so from axis one.
What we see is that in each group, no smokers have better life expectancy. This data set represents an example of Simpson's paradox, a phenomenon in probability and statistics in which a trend appears in several different groups of data, but disappears or reverses when these groups are combined.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference