Learn about plotting more than two variables; sorting rows; matplotlib colormaps; Jupyter interactivity; DataFrame groupby; and matrix scatter plots.
- [Instructor] In this video, we will look at ways to encode more than two variables into plots. We start by creating an interactive plot showing data for all countries together similar to the plots found on the gapminder.org website. And to the child mortality versus babies per woman plot, that we made in the first video of this chapter. And we'll pick up where we left things in the last video. Plotting income per person per day versus life expectancy. So we import our modules and load our data set.
We'll make a function and keep changing it to add features. The simplest version of our plotting function selects data for a year. And then creates a scatterplot. Let me try it out immediately.
I get a sense of the correlation between the two variables, but I don't really know which country is which. I can use the size of points as an additional dimension to encode population. So I will create an array area. And pass it to scatter. It's so blue. So what happens here is that the number in data populations are so big, that they give us very large dots.
We need to slim them down. Maybe by a factor of a million. Better, perhaps we can afford a little larger. I think we begin to see China and India. To make the plot more intelligible, we can add borders. Thin ones and black.
We can also sort data points by population, so that the larger dots sit in the back and don't hide too many others. I will sort the data frame directly. Using population And going down in size. Now we can see a few points that were previously obscured. We can also use color as an additional dimension.
For instance, to encode child mortality. I will first grab the data. Pass it to color, and I also need to select a color map. For matplotlib.cm. What I can do here is to reverse the scale so the dark will be worse and fix the range of the color scale with vmin and vmax.
I also need a little workaround which I found on stack overflow. So that the x label is not hidden. At this point, we're visualizing four variables in this plot. Last, we can color the borders of the dots to represent one last variable. This time categorical, the region. So we turn the region into named colors using the method map of pandas, which takes the dictionary.
I will also set the axis ranges. And make the figure larger. And now let's animate. Instead of a slider as in the first video of this chapter, we'll use a simpler selector.
As I move through the years, the progress of all countries is evident. In income, life expectancy, and child mortality. But especially so for Asian countries, which now host more than half of the world's population. How do we know? Well, pandas can tell us. I sum up the total population, and I divide up results using the dataframe method groupby.
Asia has more than 4.3 billion people of the 7.3 total. We can do one more thing that would please Rosling and show his income bands. I grab my code. And start plotting vertical lines at 4, 16 and 64 dollars. We make them dotted and black.
Level one, less than four dollars is extreme poverty. We see that by 2015, most of the world has risen out of extreme poverty. With a lot of Asia occupying level three, between 16 and 64, that was more typical of the developed world back in 1965. Statisticians and data analysts have thought of many more ways to plot multiple quantitative variables at once. Some of those are available in the panda sub module, 'Plotting'.
For instance, we can plot a scatter matrix. And we select again year 2015. And we select four variables. I need to give the data to scanner matrix.
What I see here, are the one dimensional and two dimensional paired distributions of the variables. This will be a little clear if I use the logarithmic income. Let me just add it to my gapminder dataframe.
Now we can see very clear trends between all variables. And with this, I think we've made enough plots.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference