Learn about bar plots; pie plots; and grouped and stacked bar plots.
- Let's make some plots for the Whickham data set, which as we have seen, displays Simpson's paradox. We load packages, recover the data set, reproduce the grouping or stratification by age and the two grouped sets of proportions. By smoker, which was puzzling, and by age. So we'll start by plotting counts separately for our response and explanatory variables.
So I'll take value counts for outcome and plot them as a bar plot. I will add a similar plot for smoker. But assign it to a different matplotlib subplot. So this will be two of two and smoking will go into one of two.
A little squished, so I can make the figure larger to accommodate them better. I can also give them different colors and while I'm add it, titles. Much better. If you wanted, you could also do horizontal bars, just change the kind of plot.
Or we could do pie charts, the kind would be pie, but color must now be colors. In this case, the interface is inconsistent. Now we'll break up the visualization so that we show outcome by smoker status. The simplest thing we can try is just to call plot on by smoker.
It works, but I don't like the way the labels are set up. The way the labels are set up reflects the multi index of the bye smoker object. We get a better result if we unstack it first. But as we unstack the data frame, we can actually stack the bars.
Okay, this plot visualizes the region of suspicious finding that smoking improves the outcome. So let's break it up by age group. Here's the first attempt. Again, this is serviceable, but it would be nicer to group the two smoker, no smoker bars for age group so that we have a direct visual comparison. To do that, let's look at the underlying data frame.
Perhaps we can sacrifice keeping both the alive and dead fractions since they always sum to one and then use the columns for the smoker status. So we first drop the column, dead, it's a column so it's on the axis one, and then we unstack again. This is what we want, but the labels of the columns are somewhat messy so we will restructure them.
I copy this result into a new frame and I just replace the columns wholesale by no and yes. And also give them a name, smoker. This is much cleaner and ready to plot. Here we see that in every age group, non smokers have a slight edge in outcome, Simpson's paradox at work.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference