Learn about plotting on images; null and alternative hypothesis; and p values.
- [Instructor] For this video, we leave local covenants and we go back to 1854 in London, a very rough time because of the repeated and deadly epidemics of cholera. Physician John Snow used simple statistics and beautiful plotting, to pinpoint the origin of one such outbreak to a contaminated water pump on Broad Street. It's a fascinating story, go look it up. We are going to follow in John Snow's footsteps to learn about testing hypotheses. So here are two datasets.
Let me first import packages. The first dataset contains the positions of eight water pumps in Central London. These are given as latitude/longitude, and also distances from a reference point, pump number zero, in kilometers. The other dataset contains the number of deaths at different locations, as well as the closest pump to that location.
Let's look at one record every 20. We can plot this quickly. We want the square figure, and a scatter plot of the pumps, and the deaths. And we'll make these a little smaller.
In fact, it would be fun to over-plot this on a map of London. I have obtained such a bitmap from Google Maps, using the central coordinates. So I load this with matplotlib. Now I can show the image using imshow, but I also need to know its size.
So Google Maps, I found out it's about 7.6 kilometers in both height and width. Now I can over-plot my pumps and unlucky addresses. Very nice, this map compares quite well with John Snow's original. In fact, the map seems already rather damning for the pump in the center, which is pump number zero.
So let's look at this. This is a tally of addresses, but we really need the total number of deaths closest to each pump. So we group the data by the column closest, and then sum up the deaths. Okay, so there's no doubt. If deaths occur randomly in each area, there's no way we could get 340 in area zero and so few in all the others.
So to make the game more interesting statistically, we will assume that the populations of each area are very different, with many more people living in the area closest to pump zero. Thus, we do expect more deaths there. So let's make a simulation. We'll use only areas zero, one, four and five, which have the most cases of cholera, and simulate each death randomly, proportionally to the population of each area. I'll write the function for this.
Enclose the results in a DataFrame and use numpy random.choice to select a number between zero, one, four and five, n times, with probabilities proportional to populations in the areas. So 65 percent of people in area zero, 15 in area one, and 10 percent each in areas four and five. So let's try this once for the total number of deaths, which is 489, closest.
So we get something close to what we actually observed in the true data. What we need now is the sampling distribution of the number of deaths in area zero. I will extract the count for area zero, repeat the operation 10,000 times, and enclose the result in a DataFrame.
This will take a few seconds. I will look at the histogram. We have generated this distribution under the null hypothesis that the pumps have nothing to do with color, and the deaths occur simply proportionally to population. We can now compare this distribution with the observed number of 340 deaths in area zero. More precisely, we evaluate at what quantile we find 340 in this null hypothesis sampling distribution.
Remember, I used scipy.stats.percentileofscore. So 340 is a very extreme value, which we would not expect from the null scenario. In fact, we'd expect it only 1.86 percent of the time. This is known as the P value, the smaller the P value, the more strongly we can reject the null hypothesis.
I've just presented a very simple example of hypothesis testing. We have made an observation, many deaths in area zero. We have made a hypothesis it's the pump. And we have estimated the distribution of expected deaths under a null hypothesis. Last, we have verified how extreme our observed finding was with respect to the null distribution. Note that the only two permissible conclusions from a formal hypothesis test such as this is I reject the null hypothesis, or I failed to reject the null hypothesis.
This is a very formal way of reasoning, but it's the only one we can support firmly with statistics. The scientific community has recently witnessed a harsh debate about the value of hypothesis testing, and especially about what P value should be required to make a conclusion. The problem is that if you select, say P of 0.05, five percent, but you make many tests, eventually you are going to find many where the null hypothesis is wrongly rejected, just by chance.
The lesson is that one must be careful about making conclusions, for instance, by requiring a lower P value or an established causal link, and not just apply a formula blindly.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference