Learn about bootstrapping; sampling with replacement; simulating Gaussian mixture, and truncated distribution.
- [Instructor] A year after Mayor Green is elected, she embarks on a series of very contentious reforms. There are doubts as to the level of support that the citizen have in her. In your job at the newspaper, you're still covering politics, so you prepare to do another poll. This time, you ask the people you interview to give the mayor a grade between zero and ten. You're feeling lazy, so you take only 100 samples. Then we import packages, and load the data.
It seems the citizens gave you the grades with great precision. We're wrong with this. So we look at the histogram, and this cribda sample with summary statistics. The histogram has no recognizable simple form, but the mean for the sample is 5.5. What can we say about the true mean value? This time we cannot build a confidence interval by simulating the sampling distribution because we do not know how to describe it.
And, indeed, given the observed histogram it is unlikely that it has a simple form such as a no amount distribution. However, we can still use computing by adopting a powerful idea in modern statistics, bootstrapping, which was introduced by Efron in 1979. What we'll do is to estimate the uncertainty of our statistic, the mean, by generating a large family of samples from the one we have. And then, characterizing the distribution of the mean over this family.
Each sample in the family is prepared as follow: we draw grades randomly for our single existing sample allowing the same grade to be drawn more than once. Technically speaking, we are sampling with replacement. Let's try to do it once. There are 11 pandas method is sample over 100 from dataframe pop with replacement.
We see that for this bootstrapped sample the mean is a little difference. So let's build the bootstrapped distribution of means. We generate a bootstrap sample take the mean repeat this 1000 times using a Python list comprehension. And then fold this into a dataframe.
We'll call the variable mean grade. And save everything into dataframe bootstrap. Let's take a histogram. Remember, these are not grades, but they are means of grades. Let me show you the original mean drawn as a line on top of this.
The mean is actually the same, instead of our sample, if you think about it, it has to be. But there is significant spread around it. So let me extract the quantiles. That's it, bootstrap approximated, 95% confidence interval for the mean grade. It is between 5.1 and 5.9.
It seems that the mean grade is likely to be a passing one. The bootstrap procedure requires that the sample you have is representative. And the procedure is justified by a rather complex mathematics, and the rather general assumptions. For the skies I will show you that the guess is acceptable by showing you how I really generated the data set. The distribution that I used was actually a sum of two normal distributions with equal weights.
We can use side by stats to handle and play with distributions. If you don't know much about the normal distributions, just follow along qualitatively. I will plot this between 0 and 10.
For a side by stats distribution object, PTF, returns the probability density. Here we go, this is a bimodal distribution. I also truncated this distribution, because there can't be grades below 0 or above 10. So I made a function to drop a simple grade. I have yes for a side by stats distribution object just so it turns a random sample from that distribution.
And I also need to choose between the two, which I do just by drawing a uniform distributing number between 0 and 1, and comparing it with .5. Then I truncate. So I continue drawing until I get a sample that's acceptable. Let's try it once. And now I can make a data set by calling this repeatedly.
Let's histogram a few of these samples. We select the column grade, we do histogram. And as for the sampling distribution of the mean, we can use simulation in a straightforward way.
Let's look at the histogram and compare with the bootstrapped distribution. We see that the 2 sampling distribution is displaced on the bootstrap estimate, but the spreads are comparable, which justifies our approximated confidence interval.
- Installing and setting up Python
- Importing and cleaning data
- Visualizing data
- Describing distributions and categorical variables
- Using basic statistical inference and modeling techniques
- Bayesian inference