Join Barton Poulson for an in-depth discussion in this video Validating, part of Data Science Foundations: Fundamentals.
- [Voiceover] A topic that can be easily overlooked in data science analyses, unfortunately, is that of validating models. The question here is, Are you on target with your analysis? The problem is that many machine learning algorithms fail on implementation, many scientific studies cannot be replicated. And so there's the question of whether your analysis is actually giving you good insights into the general nature of the problem and not just the specific data at hand. To put it another way, your model fits the sample data beautifully.
It's tailored, it's great, but will it fit other data? This is considered the issue of generalizability or scalability. Now, one way of looking at this is with posterior probabilities, where you take information about your present data and you combine it with information about the past to get some sort of impression about the future. Most analyses give you the probability of the data given the hypothesis. Fine, that's the basis of standard testing, but there's more interest and more utility in the probability of the hypothesis given the data.
To flip those two around, you have to use Bayes' theorem, which I've talked about previously. The simplest method of getting validation is with the replication of studies. That is, you want to see if you can do it again and again and get the same results. You can look at an exact versus conceptual replications. So an exact replication is it's totally the same all the way through, versus conceptual you introduce variations into it. And then you can also combine the results of studies, and you can do that with meta-analysis or with Bayesian methods.
And either one would be effective because replication is considered the gold standard in many fields. Next is the approach of holdout validation, and what this is, is where you take your data, you split it into two parts, and you create a model on one part of the data, and then you test the model on the other part. This is conceptually a very simple task but it does require a very large sample, where you can afford to ignore half of the data until you're testing. This method, however, is frequently used in analytical competitions.
Next is cross-validation. This is when you use the same data for both training and testing. One common approach is what's called leave-one-out or LOO. This is easy and fast for linear models. It's where you calculate the reliability of the results by removing one value at a time and calculating the fit based on all the others. That's actually known as the jackknife method. There's a variation on this called leave-p-out, where you pick a particular number.
And there's k-fold, where you split the data into multiple groups, where you set one group aside to be used as the test case, you model on the others then test it, and then you rotate through so every group gets used as the test at one point. From this very brief discussion, we can reach a few conclusions. Number one, you want to make sure that your analysis counts and that it's going to inform you when you get past your sample data. You want to check validity of your conclusions and you want to check the generalizability of your model.
This is important because it can build confidence both in your analysis and in the model that you've created.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data