Join Keith McCormick for an in-depth discussion in this video Understand integration, part of The Essential Elements of Predictive Analytics and Data Mining.
- [Keith] Something that every project needs and benefits from is extensive data integration. Now, I know that we might think that, through data warehouses, tools that engage an automated data blending and the like, that this problem is somewhat resolved, but in my experience, it really isn't. In order to have a successful project, you want as many different sources of data throughout the organization as possible.
And although the effort to break down so-called data silos has been ongoing for many years, there's still more work to be done in that area. So on a typical real-world project, I'll be combining about six to 20 different sources of data. Something that folks usually are afraid to get into, not through lack of ambition but because they think they won't have enough time, is to incorporate external data, weather data, all kinds of different things, unstructured data, right? This does add to the effort, but it almost always pays off for the simple basic notion, the harder it is to integrate this data, the better the project is going to be because the data that is already easily integrated and automatically blended is already being examined.
It's working its way into business intelligence reports, it's being reviewed on a daily basis. So insights that involve sources of data that have already been successfully combined are baked into the cake, so to speak. It's really by breaking down these silos and combining data that's never been successfully combined before that you can get those surprising insights that can make or break a project. Let me give you a quick example. I was working on a cell phone churn project some years ago and I wasn't able to get access to dropped calls.
Might sound surprising at first. Well, I was working with the customer relations management team and, as the engineers explained to me, they said, "Keith, "customers don't have dropped calls, towers do." So I had to involve the engineers to get at that data. Now, as it turns out, it wasn't a top-10 variable, but it was in the top 50 and it did get incorporated into the final model. If you do that with dozens of variables, it's going to pay off in the end.
- What makes a successful predictive analytics project?
- Defining the problem
- Selecting the data
- Acquiring resources: team, budget, and SMEs
- Dealing with missing data
- Finding the solution
- Putting the solution to work
- Overview of CRISP-DM