From the course: Data Science Foundations: Data Assessment for Predictive Modeling

Introducing the critical data understanding phase of CRISP-DM

From the course: Data Science Foundations: Data Assessment for Predictive Modeling

Start my 1-month free trial

Introducing the critical data understanding phase of CRISP-DM

- [Instructor] CRISP-DM is an acronym for the cross-industry standard process for data mining. This course is about the second phase of CRISP-DM. So we'll review a bit about CRISP-DM, but just enough to put our course goals into context. If you're already familiar with it, there will still be some important context for you. So we can see that data understanding is the second phase in a six phase process. This, of course, is the famous circular diagram. But it's critical to note that CRISP-DM is much more than just a circular diagram. In fact, if I zoom out a bit so that you can see the entire page, you'll note that this is page 10 of a 76 page document. And this is the beginning of part two, when the circular diagram makes its first appearance. So what does the document say about data understanding, how does it describe it? Well, with the following key points. First, although this is the second phase, we're just starting to assemble our data. Then it continues to describe some of the tasks, which could easily be confused with just exploring, getting familiar with the data, identifying data quality problems, getting some of our initial insights, and even possibly detecting interesting subsets of our data. So what is this mysterious, hidden information? This is a critical phrase, because we must remember that data understanding is just prior to data preparation. So we need to look carefully for information that is hidden, not only to us, but hidden to our algorithms. We must be careful not to have magical thinking regarding our algorithms, they need our help. Perhaps the best one sentence summary of the entire course is this. During data understanding, we are uncovering hidden patterns which we reveal through data preparation so that our modeling algorithms can detect them. Let's briefly address another issue. No one really uses the phrase data mining anymore. So what did they mean at the time when they wrote it and chose this to be part of their title? Well, these days we'd call it traditional machine learning. Specifically, we're more focused on supervised machine learning. Supervised is more common than unsupervised in general, but also in CRISP-DM, we're focused on deployed models and unsupervised models just aren't deployed in the same way as supervised models. So what else did they mean when they chose the phrase data mining? Well, they didn't mean exploring as an end goal and data mining didn't have the privacy implications that it has today. It merely meant finding useful patterns and data, there was no assumption about what kind of data it was or where it came from. That negative connotation came later, mostly from journalists adopting the phrase when discussing privacy concerns. And that's an important topic. But data mining was not associated with that topic when CRISP-DM was originally written. Also, the author's were focused on building predictive models, not merely discovering interesting things, and then using those models to score new cases, to make predictions. So the name may have changed but those are still our goals today.

Contents