A common mistake that can occur when working with data is not cleaning your data, which can be problematic. In this video, learn how to identify when your data has values that should not be there and proactively clean your data.
- [Instructor] A common mistake that can occur when working with data, is not cleaning your data. This can pose problems. For example, say I have information about students' grades on a particular exam. There were no opportunities to earn extra credit on this exam, and the grades on this exam are stored as percentages in an numpy array, which I've displayed here. As you can see, there are values here that are over 100. This looks weird. Students were not given opportunities to earn extra credit on this exam, as mentioned earlier, but the data indicates that some student's exam grades were reported or entered as higher than 100%. After making this observation, I should not immediately move on to using this data as is, to build a model, as the results may be skewed due to the weird values in the data. Instead, I should clean the data. Since the values that are above 100 directly conflict with the premise of the exam, I will remove these values from the data. In other words, I will just select the values that are between zero and 100 inclusive, and reassign the variable grades to that selection. Now, I'll display grades to see the updated data. There we go. Make sure to proactively clean your data when you notice that your data has values that should not be there.