Learn about the role of visualization; scatter plots; and Jupyter notebook interactive widgets.
- [Instructor] We have worked through the hard job of cleaning data and learning the basics of pandas, or at least refreshing them. Now, we can have some fun. Statistics is the science of learning from data and of reducing complex structures and trends in the world to succinct numerical descriptions and do powerful visualizations. And nobody was more apt at identifying and explaining global trends in data than the late statistician and public health expert Hans Rosling.
His book Factfulness, and his website gapminder.org, are must reads and must see for anybody who wants to understand our complex world as it is. And certainly for anybody learning statistics. Throughout this chapter we will use data curated by Rosling's organization. Here I want to give you a preview of the powerful visualizations that we can achieve very simply with Python. And we use one of gapminder datasets.
As usual, we import some modules. And we load the dataset with read_csv. Let's have a look. Here I'm selecting every 20th row, up to row 200. For all the countries in the world, and for years starting in 1800, this data frame shows us basic facts about life in those countries, the population, the expected lifetime, that is the average age of death for all born, the percentage of children surviving to age five, the average number of babies per woman, the gross national product divided by population, and the income available, on average, to each citizen each day.
These last two columns are given in 2011 equivalent dollars. One of the points that Rosling makes powerfully in his book, is that the number of babies per woman depends strongly on child mortality. With women having more children, when it's harder for them to survive. To see this, very simply, we plot the number of babies per woman on the X axis and the percentage of children surviving to age five on the Y axis. First, we down select the data to year 1965.
We do this with numpy-like smart indexing, creating a Boolean expression within the brackets that index the rows. And then we select the plotting function, and specifically, scatter. Now we can just tell scatter which columns we care about. This plot shows very simply that when children have a hard time surviving, women have more babies.
But we can do much better. Let me show you how we can put together an interactive plot, similar to those on Rosling's website. I will go through this quickly now, and then I'll explain details of this plot throughout the rest of this chapter. We'll create a functional plot here that creates a scatter plot for these two variables. So we start by down selecting the data, to a specific year.
And then we repeat our plotting instructions. And now let's make this more interesting. For instance, we can make bigger dots based on the number of people in each country. To do that I extract the population column, and I multiply it by a small constant that I have worked out by trial and error. I also need to tell scatter to use that.
Size equals area. Also, let's use different colors for different regions of the world. I select the region from a table, and I will map each value into a different color. Africa will be blue, Europe will be gold, America will be green, and Asia will be coral.
Again, I need to tell scatter about this. Let's see what we have so far. This is already very nice. I can do just a little bit better by adding an edge to each dot, and making the figure bigger. The edge will be black and thin.
And the figure will be pretty large. Finally, I will set the range of the axis and give descriptive labels to the axis. Let's try it again. We can go even further, adding interactivity.
For that, we will use Jupyter's notebook's ipywidgets, and specifically the method, interact. I will create a slider widget that lets us select the year. And I can give the widget a range, a step and an initial value. I will hide some interface elements so we can see the entire figure.
This plot, for 1965, shows the world divided between the developed world, with very low child mortality and few children, and the developing world, with high natality and high mortality. In his book, Rosling argues that this distinction does largely disappear as we move into the present, and we can see that by moving the slider. In 2015, most of the world has caught up with the developed countries, so to speak.
That's a tremendous achievement for humanity, and it's beautiful to see it from just a few numbers from a simple table.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference