From the course: Learning Data Science: Understanding the Basics

Understand probability

From the course: Learning Data Science: Understanding the Basics

Understand probability

- Probability is another area in statistics which allows you to tell interesting stories. The probability is the likelihood that something will happen. It's a measurement of the possible outcomes. If you flip a coin, probability predicts whether or not it will come up on one side or the other. The statistics side of probability focuses on probability distribution. If you throw a six-sided die, that means that there are six possible outcomes. That means that the possibility of any number coming up is one in six. That means that each time you throw a die, you have a 17% chance of hitting a particular number. Probability can also show a sequence of events. What if you want to show the likelihood of hitting the same number twice in a row? Well, that's 17% of 17%, or roughly, 3%. If you're playing a game of dice, that's a pretty low probability. Your data science team will certainly want to work with probability. It's a key part of predictive analytics. It'll help you figure out the likelihood that your customer will do one thing or another. I once worked with a biotech company that was trying to predict the likelihood that someone would participate in a clinical trial. Getting people to participate in the clinical trials is a tricky business. There are certain number of clinics, and it costs a lot to keep them up and running, even if they're empty. If they don't fill up, then that company loses revenue. They use data science to ask some interesting questions. What are some things that keep people from participating in clinical trials? It turns out that there's a number of things that might decrease the probability of someone participating. If people can't eat the night before, then they might be 30% less likely to participate. They also might be 20% less likely to participate if they have blood tests and needles. They have to balance out the probability of people participating against the accuracy of the results. Let's say that they had a drug trial, and they could check for its effectiveness using either saliva or a blood test. The blood test was 10% more likely to be accurate. That was easy, they should just use the blood test. But hold on. If they run the trial with the blood test, they'll have 20% fewer participants, which would decrease the amount of data points for their study. They lose the people who decided against the study because they were afraid of needles. If they want 1,000 participants, that means about 200 fewer people. That brought up another interesting question. If the test has 200 fewer people, does that mean that they'll have less accurate results? The data science team created another probability distribution. What if the drug has a chance of causing some type of reaction? You'd have more data points with 1,000 people than 800. The data science team had to take that into account. Was it better to have more people in the study without needles, even though it was less accurate? This led to even more interesting questions. Should the team have taken the saliva test several times to increase the probability of an accurate test? In the end, that's what the data science team was helping the company to decide. Maybe it was best to have the greatest number of people participate in the trial to increase the likelihood of catching a reaction, then take the least accurate test more often to increase the probability of having an accurate result. That way the company could have a maximum participation and at the same time, increase the likelihood of their study. All brought to you through the power of probability. There are a few things to keep in mind when you're working with probability. The first is that probability will lead you to some unexpected places. Who would've thought that a medical practice might get better results by administering a less accurate test? The second is that probability can also be a great vehicle for asking more interesting questions. Don't be discouraged if your questions just lead to more questions. Remember that data science is applying the scientific method to your data. Sometimes this path will lead you to an unexpected place. The important thing is not to jump off when the path takes a sharp turn. That can happen when you're working with probability. Keep in mind that these sharp turns are often the path to your greatest insights.

Contents