Join Barton Poulson for an in-depth discussion in this video Pipeline, part of Data Science Foundations: Fundamentals.
- [Voiceover] When you're conducting a data science project, there's an entire sequence of events that have to happen. I refer to these as the data science pipeline. Basically, there's four general categories of tasks. In part one, you're doing planning. In part two, you're doing data preparation. In part three, you're doing modeling or the statistical analysis of the data. And in part four, you're doing follow up work. We'll look at each of these very briefly. First, part one, which is planning.
There are four basic tasks involved here. The first one is to simply define the goals, what is it that you're trying to accomplish, that way you can focus your efforts and you know when you're done. Second is organizing your resources. So for instance, what data do you have available, what machines do you have available. More importantly, what people do you have available and what time do they have available? Third, once you've recruited a data science team, you need to coordinate the efforts between those people.
It's a social task but it's critical to the success of the project. And finally, in terms of planning, there's the effect of scheduling the project. Because data science projects are typically collaborative and done for a client, this can be an important aspect that needs to be given some thoughtful attention. Part two is data preparation. This is where you need to first get the data. It can come from a lot of different sources. There can be a lot of creativity involved in this.
Next, clean the data. That is, make it so the data fits well into whatever program you're using and that you check it for errors, you check it for anomalies, and you make sure that what you're working is valid and reliable. In step seven, you explore the data. See what the distributions are like. See what the associations look like. And in step eight, you refine the data. You choose the cases you're going to include. You choose the variables you're gonna use. You create new features you want, and that gives you the actual content that you're going to work with in the next section of the data science pipeline.
In part three, you do the actual modeling or the analysis of the data. Step number nine is create the model or models; do several. Once you've created a model or several, you need to validate the models. That is, you need to make sure that the model is accurate and that it's going to generalize well. In step 11, you evaluate the model. You try to see how accurate is it and how much does it actually tell you about the question you're trying to answer. Step 12 can be refine the model.
Based on the evaluations, you may wanna make some tweaks, and you may want to make some tweaks to the model to make it as easy to implement and as informative as possible. Part four is followup, and this is where we get backed out of the technical realm. This involves presenting the model. You usually have a client and you're going to have to present the results of your analysis to them in a way that makes sense to them and they know what to do with it. Following that is deploy the model. If you're developing a predictive model that will be used, for instance, for an e-commerce website, you actually have to stick it on the server and you have to get it so new customer data comes in and you make predictions.
After that, you often have to revisit the model cause scaling is a tricky issue. You develop it on one data set but now you're implementing it in another and you often have to make changes to see how well it works in that setting. And finally, although it may seem trivial, it's important to archive all of the assets that you've used. This includes the data sets in every step, from raw data to cleaned to final analysis, the code that you used, the presentations, the notes. That way both you can find out what you did before, your client understands it, and if anybody needs to go back and verify the analysis, it becomes possible.
So what are our conclusions from this very brief picture of the data science pipeline? First, data science isn't just technical. The central parts, parts two and three, those were technical, but everything before and after was not technical, it was larger. Specifically, you can call them contextual skills, and those are critical to the success of the project. Finally, data science fosters diversity. You need so many elements and you need so many different perspectives on it to make it work well that it's critical to have people who have each of these different backgrounds and can look at the data from each of those ways.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data