Learn about sampling distribution; simulating data; confidence interval; confidence level; and score at percentile.
- [Instructor] In this chapter, you will be a journalist who knows a little about statistics. Imagine a very important election is taking place in your city with an incumbent mayor, Mr. Brown, against the local celebrity chef, Mrs. Green. You work for the local newspaper, and you're asked to poll your co-citizens for their vote. To make things easy for you, we'll assume you can reach every voter by phone and that every poll voter replies truthfully.
Both are not trivial assumptions in reality, but in this case, there are no selection effects. Laboriously, you call 1,000 voters and ask for their voting intention. I'm giving you a file with your findings. I load packages as usual and read my file.
As we have learned in chapter three, we may count votes using the DataFrame method value_counts. In fact, let's give the queue of normalize to get the fractions, the proportions for each candidate. The data seem to say that Brown is going to remain mayor. However, you realize that the limited sample means that the proportion depends on the specific people that you happen to draw. This is known as sampling variability.
So given this poll, what can you really say about the underlying population of voters? To understand this, we need to study the sampling distribution of the proportion, namely, we wish to understand what range of different samples we may get for the same population, and we'll do this by simulation on a computer. To this end, let me build a simple function that simulates such a sample. The function would take the actual fraction of votes for Mayor Brown and the number of people polled.
So we can use NumPy random rand for this, which returns a vector, let's say five, of random numbers between zero and one. So, if we test that a random number is less than the True fraction, say .51, we will get True or False for votes for Brown and Green, respectively. We can then apply the NumPy function of where to convert this Boolean value into a string.
And we wrap everything in a DataFrame and make a function out of it. I replace the number of poll voters in the Brown fraction by the arguments of the function. So now, let's say that True Brown fraction over the entire population is indeed .51. Let's see one possible sample, and the counts for the two candidates.
In this case, Brown is actually under the level of winning the election, although is True fraction is .51. So we repeat this many times and collect the results in the DataFrame. 1,000 simulated experiments should be sufficient. So let's look at the histogram.
It turns out that for a True Brown fraction of .51, we may obtain any sample proportion from .48 to .55. The converse must be true also, so that .51 that you observe may actually originate from a Green majority. Can we make this more precise and identify a likely range of True fractions? I'm going to introduce here a very important notion in statistics, that of confidence interval.
The confidence interval describes the uncertainty of inference by giving us a range such that saying 95% of the times, the range would include the True value. 95% is the confidence level, and we can choose it as we want. 95% of the times means that if we were to make polls in 100 elections and compute the confidence interval for each election, then for approximately 95 of those 100, the intervals would include a True value.
Of course, we wouldn't know which 95. So let me repeat, a confidence interval is formed from the data in such a way that 95% of the times, it would include the True value. How to do it here? There are analytical techniques that involve assumptions about the underlying distributions. However, in the case and many others, it is much simpler to simulate. We know how to simulate the sampling distribution for any True Brown fraction. So let's make a function for that.
I'll take code from above and replace the Brown fraction. Ah, we need a colon here. Let's say we want to go for the 95% confidence interval. This will lie between the 2.5% quantile and the 97.5% quantile.
So we look for the True fraction for which a measured value of .51 lies at a 2.5% quantile, and the True fraction for which our measured value lies at the 97.5 quantile. It turns out that those two fractions are the edges of the confidence interval. If you think about many different experiments where we repeat this procedure, you can convince yourself that this is indeed the case. So I'll make a function that extracts those quantiles by first calling samplinglist and then calling DataFrame quantile on it.
Let me explore a few values until we find .51 on each end. We do this approximately, but of course, you could be the function that does it more exactly, 0.49. And on the other side, a little more. So, for an observed sample, proportion of .51, when the sample size is 1,000, the 95 confidence interval for the True population function is .48 to .54.
We can also express the same by saying that our point estimate is .51, and that the margin of error is .03 on either side in 95% confidence. Thus, the result of this election lies within the margin of error of the poll. That's not very satisfying for a journalist such as you. So we can do better by increasing the size of the sample. How much bigger would we need to be? Luckily, we have a way to simulate it.
We'll create a sampling distribution for a True fraction of 50%, and with 10,000 samples. This takes a few seconds. Let's histogram again. You see from this, the margin of error is now more like 1%, which would have been sufficient to claim Brown as the likely winner. Under very general conditions, one can show that the margin of error improves with square root of the number of samples.
But we actually have to collect the sample. You do so by stating it. Luckily, I'm giving you the file. And we find out that the likely winner is, in fact, Mrs. Green. If we were to compute the confidence interval for the Green fraction in this case, as we did above, we'd find that it is between .508 and .528.
It doesn't include the threshold of .50. So now, you can go and write your article.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference