Learn about box plots; histograms; density plots; and weighted histograms.
- [Instructor] Let's move on to plotting distributions. If you're just starting here, you need to load packages and data sets. A box plot, which we get in pandas with plot of kind box, visualizes coverage intervals. The box extends from the 25th to the 75th quantiles with the line, the green line, at the median. The whiskers are set at 150 and 66%, respectively of the quantiles.
Points below and above whiskers are considered flyers so they're not typical and they may even be outliers. That is, point that are suspicious and may reflect measurement errors. To compare China and the US, we need to make a single data frame so we can plot the boxes together. We'll make a data frame on the fly. And we can call the method boxplot directly.
The scales are so different that we don't see much, so it's better to make box plots of the logarithm of the income. I'll change the code directly. Log 10 and log 10. This shows clearly the fact of 50 between the two distributions. A much richer visualization of its distribution is a histogram. A histogram divides the data into a set of contiguous bins and then for each bin, shows a rectangle with height proportional to the number of data points in the bin.
Easier to show this than to describe it. So again, we select the income column in the data frame. We use the method plot but now the kind is hist. I'm going to ask for a step-histogram which means that the figure is not filled in by color. And I'm going to ask for 30 bins. Now we'll see where most of the incomes lie in China in 1965, with the most frequent income somewhere around half a dollar.
We can also plot our descriptive statistics as vertical lines on top of the histogram. Let me change the code directly. In matprolib we use axvline to plot a single vertical line. So we'll do the mean. The median. The 25% quantile.
And the 75% quantile. Let's do them all in the same color. But I will identify them by changing the line style. Dashed for median and dotted for the quantiles. We see that the mean and the median are close, as is usually the case. A density plot is effectively a smooth histogram which approximates the continuous density of the variable; if you know some calculus, you know what I'm talking about.
I will compare it with the histogram for the same data. I will restrict the x-axis to see what's happening better. To compare, I actually need to normalize the histogram so that the area under it is just one, as it is for the density plot. I do this by setting density equal true.
It is important to remember that the density plot is just an approximation since we don't have access to the entire distribution. And the approximation is dependent on the scale of this mu thing, which is chosen automatically for us but which we can set directly by setting the bandwidth, or bw, I can obtain more detail, or more small thing. Let's compare histograms for China and the US using log income.
Same codes, just a different data set. In 1965, there's basically no overlap. So the poorest Americans are richer than the richest Chinese. To understand this better, I'll show you the x-axis ticks in dollars. I'll show you levels of a quarter, of a dollar, one, two, and up in multiples of two. And then I call matplolib xticks with the location of the levels and the labels.
Let's see how things are in 2015. Very different. Both the Chinese and the Americans are richer, but there's also significant overlap. In fact, let's see if we can rescale the histograms to show the relative sizes of the population. I will get population data from our gapminder.csv data set.
And I will use the pandas query function to select what we need. Specifically, population. I need a single number, not a panda series, so I will cast it using float. The same thing for the US.
So we have 1.4 billions and 320 million, respectively. For the weighted histogram, I'm going to create a new weight column for the two data sets. The weight will be the population divided by the number of records. I have done the cast already, actually, so I don't need float. And same for the US.
I copy the code for the histogram and add the weights. Rosling points out, quit correctly, that there is a lot of purchasing power in the richer end of the Chinese population where it overlaps with the US. So corporations would do well to tap that market.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference