Explore the nine big data bottlenecks: store, assess, select, integrate, explore, prepare, model, score, and maintain.
- [Instructor] I'm going to identify nine steps in the building of supervised machine learning models where you want to pause and ask yourself, what volume of data am I processing at this step? Store. First, you have to store the data and steward it so that the organization can access it as needed. Typically, when organizations come up with a big data strategy, they are focused almost entirely on the data storage aspect, and assume that all other phases are done on the entirety of the data.
It's simply not true. The volume of data will change dramatically from step to step in the process. Assess. The data scientist who is in charge of modeling has to have access to all of the data so that they can assess it. They can't be exactly certain of what they'll need until they take a good look. While the assessment is pretty basic, they might have to look at several years' worth of data. The most important thing that they're searching for is to figure out, how much data do I need to build this model? The volume of data tends to start going down after this step.
Select. The select step is closely tied to the business problem. For instance, you might be sending promotions out via email. Well, you might only analyze loyalty card customers if they are the only ones that you have emails for. The volume of data will tend to decrease dramatically during this step. Integrate. Based upon the assessment, the basic structure of the modeling data set takes shape. What often makes this computationally intensive is that in order to aggregate the data, you might be searching many millions of rows to find the transactions that belong to the cases of interest.
Explore. Once you've gotten the data much closer to the form that you'll need, you've got to explore it extensively. At this point, the modeler is looking into data quality issues and identifying strong and weak predictors. This process must be done by the modeler, and can't be delegated to others, so typically, some kind of data sandbox has been created. Prepare. Data preparation tends to be very labor-intensive. It's not just cleaning and formatting.
The most important task is what's called feature engineering, when the modeler is experimenting with many different formulas to help make the predictions. The modeler has to be intimately involved, and it's highly iterative. Model. At this step, we use those fancy algorithms that are so closely associated with machine learning, One of the themes of this course is that folks worry too much about data volume during modeling. At this point, much of the hard work has been done, and the data sets have become smaller.
This is rarely when data volume is the biggest issue. Scoring. Unlike modeling, which is done infrequently, scoring might be done in real time. It's got to be fast. Most pay too little attention to the speed requirement at this step, and also forget a seemingly obvious, but powerful point. Scoring is sometimes done on just a single record. Maintain. Folks also underestimate the challenge of this step. If models are going to be routinely refreshed and rebuilt, then you will probably want to automate the entire process.
A lot of steps that were done manually at first now have to be transformed into a production step.
Note: This course is software agnostic. The emphasis is on strategy and planning. Examples, calculations, and software results shown are for training purposes only.
- Evaluating the proper amount of data
- Assessing data quality and quantity
- Seasonality and time alignment
- Data preparation challenges
- Data modeling challenges
- Scoring machine-learning models
- Deploying models and adjusting data prep and scoring
- Monitoring and maintenance