Join Barton Poulson for an in-depth discussion in this video Confidence, part of Data Science Foundations: Fundamentals.
- [Voiceover] The next approach for inferential statistics that we want to talk about is confidence intervals. These are methods that try to directly answer the question, how big is it? How big is the statistical effect, the difference between the group means or the association between the variables? Next is you have to pick a level of confidence. You have to choose this level. 95 percent is the most common. Or as I show here with the concentric circles, you can go in and make it a little bit narrower. You can reach out and have a broader conclusion. And the idea here is that the more confident you wanna be, so for instance, going from 95 to 99, you're gonna have a wider interval, or you'll have a bigger circle in this particular example.
There's also a trade-off between two elements. The first one is accuracy. Now, in confidence intervals in estimation, accuracy means it's on target, or it's centered around the true value. More specifically, the confidence interval is accurate if it contains the true population value. This, of course, leads to a correct inference about the population. Accuracy is contrasted with precision, which means something different. Precision in this case means a narrow interval, a small range of likely values.
And this operates independently of accuracy. And I'll show you how that works. So what I've got here is a hypothetical situation where I'm looking at values between zero and 100. I've got a thick line at 50, and I've got a dotted line at 55, where 55 is, in this case, my made-up true population value. This distribution shows a range of values, it is neither accurate nor precise. It's not accurate because it misses the true value, it's actually on the other side of 50 percent.
If this were a political poll, you'd be giving them the wrong answer. Also, it's not precise because it's spread out a huge amount. Now, this distribution is accurate because it is centered on the true value, of 55 in this case, but it's not precise, 'cause again, it's spread out all over the place and you could have about a two-third's chance of giving the correct answer, or a one-third chance of getting the wrong answer. That, again, is if you're going at the above or below 50 percent that you would get like in a political poll.
Contrast to this, this distribution is extremely precise. It's really narrow, there's just a few, you know, 10 percentage points. But it's not accurate. In fact, it's almost entirely on the wrong end, it misses the goal completely. The ideal is this one. It's both accurate, because it's centered on the true value, and it's precise 'cause it's very, very narrow. And here you can look at the four versions together, where the ones on the left are not accurate, and the ones on the right accurate, and the bottom-right one's the ideal.
Next you have the issue of interpretation, or explaining your results to your client. The problem with confidence intervals is there's sort of a disconnect sometimes between the actual statistical result and the interpretation. Now, the actual result is easy to get. And it would be, for example, that the 95 percent confidence interval for the mean ranges from 5.8 to 7.2. Those are made-up numbers. The colloquial interpretation for this, the one that people usually use when they're not thinking very carefully, is that there's a 95 percent chance that the population mean is between 5.8 and 7.2.
On the other hand, the standard approach is that population means are fixed. They don't shift, and so, this sort of implies that it shifts. And so, by the standard approach, the correct interpretation is this, 95 percent of confidence intervals for randomly selected samples will contain the population mean because what shifts in this case is the sample, not the population. I can show you that graphically with this. This is 20 confidence intervals that I randomly generated from data with a population mean of 55.
And they go from their low value on the bottom up to their high value. And you can see that 19 of them cross over the true population value, so they're accurate confidence intervals. On the other hand, interval number 18, near the far right, is shown in blue because it missed it completely. That'll happen because there is random variation in the samples and in their consequent confidence intervals. Next, a very brief list of factors that can affect the width of a confidence interval.
First is the confidence level. As you go from 80 to 90 to 95 to 99, your interval's gonna get bigger, and bigger, and bigger. Second is the standard deviation, or the inherent variation in the thing that you're looking at. Some things don't vary much, and their confidence intervals will be always narrow. Others, there's a huge amount of variation, so they'll be big. And then sample size, this is the most critical. A small sample usually ends up with a large confidence interval. And with a large enough sample size, any confidence interval can become arbitrarily small.
And so, sample size is a huge factor in getting precise confidence intervals. So where are our conclusions? Number one, confidence intervals focus on parameters and try to estimate them directly. Also, because they give a range of high and low values, the variation in the data is explicitly included. And that makes confidence intervals more informative than the results you get from standard hypothesis tests.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data