Data isn't always ready to use. Sometimes you need to take steps to prep it for ingestion. In this video, learn why it's necessary to explore your data and clean it.
- [Instructor] In this chapter we're going to start digging into our data to explore it and clean it. But first the question of why needs to be answered. And we're going to start by talking about exploratory data analysis and then we'll discuss data cleaning. So for both of these sections I'm going to split it into two portions, the why and the what. Then that will set the stage to jump into the data in the next few lessons and see some of these concepts in action. Now, three reasons that we do exploratory data analysis, or EDA for short, is to understand the shape of the data, learn which features might be useful, and then we use this information to inform the cleaning that will come next. So you can imagine this stage as building the foundation for the house that you're going to build on top of it. Without a firm handle on what this data looks like your foundation is going to be shaky and the house or the model built on top of it will likely end up being suboptimal. Okay, so what do we actually do during this phase? We could easily do an entire course just on this. There are so many different paths to go down depending on your data. So to summarize in just a few quick bullet points this step includes getting counts or distributions of all your variables to understand the shape, and you do this for both input features and the target variable. And then you could look at the data type for each feature as an integer, string, maybe a boolean. And then you check for missing data. You understand correlations between your features and maybe identify duplicates in the data. I did just want to mention that you usually head into EDA with a few key questions that you want to answer but you often allow the data to take you where you need to go within some constraints. It's important to have structure but be flexible enough to dig into areas that you hadn't planned on looking at before you actually got your hands on the data. Okay, so moving on to data cleaning. Why do we need to do data cleaning? The overarching point here is that machine learning models are not magic. If they are magic then this course probably wouldn't exist. Machine learning models are algorithms that respond in a systematic way to the data that you give it. If you give it biased data, you'll get a biased model. If you give it incomplete data, it will return weak predictions. I cannot overstate how important data cleaning is in producing a quality model. It prepares the data in the best way possible to allow the model to pick up on underlying patterns that we want it to fit to. Under that overarching theme you have shaping the data, you have removing irrelevant rows or columns, and adjusting features to be accessible for the model. We'll see this in action in the rest of this chapter. Largely data cleaning is completely driven by your exploratory data analysis. But a few things to consider or look for when you're doing this. The first is anonymizing your data. There are heavy regulations on data privacy so generally you should be removing any personal identifiers from your data if this isn't publicly-available data. The next is encoding categorical variables. We'll see a little bit more of this in future lessons in this chapter. Then you might be filling in missing data. And lastly if you have skewed features or outliers then you want to prune or scale it.
- What is machine learning (ML)?
- ML vs. deep learning vs. AI
- Handling common challenges in ML
- Plotting continuous features
- Continuous and categorical data cleaning
- Measuring success
- Overfitting and underfitting
- Tuning hyperparameters
- Evaluating a model