Learn about descriptive statistics; min, max, mean, median, quantiles; and DataFrame summaries.
- [Instructor] The job of statistics is to describe variation. In this video, we look at ways to describe and visualize the variation in a single, quantitative variable, namely the distribution of the variable. We play with a few data sets that describe the distribution of incomes per person, and per day, in China, and in the U.S., respectively, in 1965, and 2015. We'll use 2011 equivalent dollars. I have generated these data sets using a few simple numbers about the Chinese and American economies from Gapminder.org Thus, they should be understood to be qualitative, and not accurate representations of the truth.
So let me import packages and load the data sets. Let me look at one. There are a thousand entries in each of these files, so each entry represents the average income of a variable number of people, depending on the size of the country's population in that year. For convenience, I have also pre-generated a column containing the base 10 logarithm of the income.
That's because Hans Rosling argues, and I agree, that the actual difference that money makes in one's quality of life, goes roughly logarithmically with daily income. For instance, if you have 16 dollars a day, you have to go up to 64, rather than 18, before things really change for you. One way to describe the variation of a variable is by quantifying its range, or more precisely, its range of extremes. So I would look at the minimum and maximum of income.
However, focusing on the extremes is usually not very insightful. It is also imprecise if our data set is a limited sample of a population, rather than a complete census. Nevertheless, you get minimum and maximum in pandas with the min and max methods of data frames, as I just did. Both minimum and maximum are statistics, descriptive numbers that we compute from the data, and that summarize the data. Of course, another very important and very famous statistic is the mean, which is computed by summing up all the data points, and dividing by the number of data points.
In symbols, we'd write something like this. In pandas, we'd just write mean. The variance is a measure of variation tied closely to the mathematical concept of normal distribution. If you don't know about it, don't worry at this time. To compute a variance, we square the difference of the points from the mean and take the average. In formulas, and in pandas, the argument ddof=0 is related to slightly different normalizations of the variance.
Again, do not worry about it at this time. The quantile is a statistics that describe a value for which a certain percentage of the data points lie below it. We compute it as follows. Actually, let me compute two quantiles, for 25% and 75% of the distribution. In this case, we find that 25% of the China 1965 income data points are smaller than 34 cents, and 75% are smaller than 86 cents, or equivalently, 25% are larger than 86 cents.
Taken together, the 25% and 75% quantiles specify a coverage interval that includes 50% of the data points. The 50% quantile is a good choice for a typical value of a distribution, since half the samples lie below, and half lie above it. It is also known and computed as median. The inverse of the quantile operation consists in finding the percentage of the population at which a given value lies.
To find it, we actually need to go outside pandas, and use scipy.stats. The function is called percentile of score. In this case, we find that 95% of incomes in 1965, lie below 1.5 dollars. Pandas offers a convenient method that returns several summarized statistics at once. It's called describe.
We can use it to compare China and the United States in 1965. We see that, on most counts, U.S. incomes were about a factor of 50 larger.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference