Join Barton Poulson for an in-depth discussion in this video Bayes probability, part of Data Science Foundations: Fundamentals.

- [Voiceover] Bayes' theorem is an important tool that allows you to look at the other side of the coin when analyzing data. More specifically, it often helps you answer the right question. Most inferential tests typically give you the probability of the data, the observed effect, assuming a particular cause or hypothesis. But what most people want is the opposite of that. They want the probability of the hypothesis or the cause given the observed data, and those two can have very different answers. And so there's a real disjoint in what's available.

Fortunately, Bayes, Bayes' theorem does this. Now, Thomas Bayes was an 18th century English minister, a statistician. We actually don't know what he looked like, so we get to have our anonymous silhouette here. But Bayes' theorem uses prior probabilities and test information to get what are called posterior probabilities. Now, that has to do with probabilities before and after you gather data. You start with the probability of the data given the hypothesis. That's called the likelihood of the data, or the sensitivity.

And that's what you get normally from a hypothesis test. To that you add the probability of the hypothesis or the cause. That's called the prior probability, and it's like the base rate, how common is this particular situation. To that you add probability of the data. What's the likelihood of getting this particular kind of result? That's called the marginal. And when you combine them, that gives you the probability of the hypothesis or the cause given the observed data, and that's called a posterior or after the fact probability.

The actual way you combine them looks like this. The posterior is equal to the likelihood times the prior, divided by the marginal. In symbols, it's written this way. Now, it works a little more easily if I put it in terms of graphics, so let's take a look at this. Let's say this square represents an entire population of all people. Now, let's say that there is a disease, and this darker rectangle on the top represents the people with the disease. Now, we can make a test for the disease, and that test is able to identify 90 percent of the people who have the disease.

That means also, by the way, that 10 percent of the people get false negatives. And so the test catches 90 percent of the people who have the disease, and that's a good thing. And that raises a question. If a person tests positive for the disease, then what is the probability that they actually have the disease? And I'll give you a hint, it's not 90 percent. The problem is the 90 percent comes from the people who already have the disease. And we have to consider the fact that the test may have false positives.

So, we look at people without the disease, and there's gonna be a certain number here with the light blue bar that test positive even though they don't have it. So we have 90 percent of the people who test negative, and 10 percent of the people without the disease, who test positive. Now, in order to figure out the probability of having the disease if you test positive, you need a couple of things. You need to know the number of people with the disease who test positive, and you divide that by all the people who test positive, including the false positives.

So, we're gonna take this number up here, the 29.7. This number here, the 6.7, and we'll combine them. Add and then we divide. We get 81.6. That means if you test positive, there's an 81.6 percent chance that you actually have the disease. Note that's less than the 90 percent that was floating around in our head. It's not a huge amount lower, but you'll see, what if the numbers change? What if we take a disease that, instead of afflicting 33 percent, which is really common, let's get something that's less common.

Maybe only five percent. Now, even with the same sensitivity, the total percentage of people who have the disease and test positive, that's only 4.5 percent of the population. And the false positives are now 9.5 percent of the total population. So, to find out what's the probability that you have the disease if you test positive, well, we plug those numbers in, run them through. We get 32.1. That means that there's less than a one third chance that you have the disease if you test positive.

Now, I can demonstrate this a little more easily if I go to an example in R. First, let me do some individual calculations to what I just barely did. We're gonna create a variable here for the probability of a disease. That's why I'm calling it pd. And we're gonna say it hits one percent of the population. Then we'll have the probability of a positive test given the disease, and I'll put that at .999, so it's a very sensitive test. And I'll put the probability of a positive test given no disease, at 10 percent.

That's the false positive rate. And now, we can get the probability of the disease, given a positive test result, by using Bayes' theorem. When we run that through, we get nine percent. Even though the test has 99.9 percent sensitivity, it doesn't address the fact of false positives and the base rate of the disease in the population. Those two combine to make it so, even if you have a positive result, there's still a very low probability that you have the disease. I can show this to you graphically, as well, with what I call a probability curve.

What I'm going to do here is make a graph that shows probabilities for the disease, from zero to 100 across the bottom. Let me come over here and make that bigger. And so we have the Prior Probability, how common is this disease, from zero on the left to one, or 100 percent, on the right. And then, given a positive test result, what's the probability that you actually have the disease? That's on the Y. That's the Posterior Probability. Now, let me put some reference lines in here to make it a little clearer.

Come back here. So, I'll run this line and this line. And what that is is a vertical line that represents one percent of the population, let's say, with the disease. That's right near the left. The horizontal line is at about the 9 percent risk of having the disease if you test positive, assuming only one percent of the population gets that particular disease. I can give a little more information on this by preparing an entire collection of graphs for different levels of sensitivity and different levels of false positives.

I'm gonna create a matrix here, and not necessarily the best code in the world, but it will work for what we're doing. I'll start with this one plot. Similar to what we did. I'll just make that one first. And it's small, it's gonna go up in the corner. And then I'll come in and make a bunch of others. Then, I'll make that bigger, here. And so what you see is, we have tests with 99.9 percent sensitivity across the top, and that drops down to 95, then to 80 percent sensitivity.

And we have a false positive rate of one percent on the left, 10 percent in the middle, 25 percent on the right. What you can see from this is, in the top left, that if you have a very sensitive test and a very rare disease, and a very low false positive rate, it's almost guaranteed that, if you have a false positive, it's a true positive. But, when you go down to the bottom right, where the test isn't very sensitive, and you have a bunch of false negatives, and you have a lot of false positives, and, if the disease isn't very common, a positive result can be almost meaningless in this case, because of the affect of the base rates.

And that's the contribution of Bayes' theorem in understanding the true meaning of our results. By getting the probabilities round from the prior probability to the posterior probability. And so, here are our conclusions from this. First, you need information about prior probabilities in order to be able to make any of this work. If you don't have that information, you can use a probability curve to get an entire range of likely values. But, most importantly, more than anything, Bayes' theorem and his calculations are more likely to give you the right answer to the question that you or your client wanted to look at in the first place.

###### Released

7/5/2016*Introduction to Data Science*provides a comprehensive overview of modern data science: the practice of obtaining, exploring, modeling, and interpreting data. While most only think of the "big subject," big data, there are many more fields and concepts to explore. Here Barton Poulson explores disciplines such as programming, statistics, mathematics, machine learning, data analysis, visualization, and (yes) big data. He explains why data scientists are now in such demand, and the skills required to succeed in different jobs. He shows how to obtain data from legitimate open-source repositories via web APIs and page scraping, and introduces specific technologies (R, Python, and SQL) and techniques (support vector machines and random forests) for analysis. By the end of the course, you should better understand data science's role in making meaningful insights from the complex and large sets of data all around us.

- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data

## Share this video

## Embed this video

Video: Bayes probability