Learn about querying data; scatter plots; logarithmic plots; and plot marker color and size.
- [Instructor] Plotting two variables together lets us hypothesize and inspect possible relations between them. Do they rise and fall together, is it possible that the change in one, is causing the change in the other? Formally, we say that we would like to explain the variation in a response variable as a function of the variation in an explanatory variable. We'll go back to the Gapminder global dataset, and first load packages.
A simple thing to do, is to plot a variable using the date as the explanatory variable. This is known as a time series. For instance I'll look at a population of my country of origin, Italy. I will first down select the data. In Pandas we plot two variables together, with plot.scatter.
And we just give it the two variables. We see that in this dataset, points become denser after 1950. The slope of the plot doesn't change very much, something like this would be very different for China, or for India. Let's try.
Going back to Italy, let's look at income per person per day in 2011 equivalent dollars, again, as a time series. Let me change code I have already. As Rosling teaches us, let's looked at the logarithm of income. Clearly the last 20 years have been disappointing. We can also plot log income against the variable related to the quality of life, such as, life expectancy.
So I will move income to the x-axis, and life expectancy on the y-axis. Now I want the log plot for x, not y. Even if income has decreased of recent, life expectancy has continued to grow. To provide more context for this plot, we can mark decades by changing the size of the dots.
I will create an array size by using the function numpy.where. So where the year is an integer multiple of ten, I would use a large dot, or a small dot otherwise. Let's throw in also the US. I will create a down selected data set, just for this.
I need to change the data sets. And I will also use color to distinguish the two countries. This case, I merely forgot a quote.
I also forgot to apply color, which I do by setting C equal to the array that I just made. The progress of the two countries is similar, with the US consistently richer. But also little less healthy. How about China and the US? It takes a very small change to the code. Now, red is probably appropriate.
To understand a cluster of points at the bottom left, it's best to connect this kind of plot. Which we can do by adding a line plot on top of it. Let's see how to do it. We'll down select the data to China only. And for once instead of query, I'm using the Numpy style fancy indexing. And then I'll plot the line.
I also need a little Matplotlib trick to put the two plots together. I need to save the object return by scatter, which is a Matplotlib axis object, and pass it to line. So the precipitous drop in life expectancy happens in 1959, with a Great Chinese Famine, when drought and poor agricultural policies led to the death of tens of millions of people. It's striking to see it reflected in just a simple plot such as this.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference