Understand common data quality problems.
- [Narrator] Now let's take a few moments and consider why we need to check our data before we start doing data analysis. In the ideal world, the data sets we work with are complete and accurate. In the real world, this is rarely the case. The more complicated the data set, the more chances there are for problems to arise. Now, some common problems that occur are missing values, values that are outside the expected range, malformed or poorly formatted values, values that violate business rules, and inconsistent values across tables, such as mismatched or missing customer IDs.
Missing values can occur for a couple of reasons. The source system from which we derive our data set may not require values, in which case there's possibly no way to find the missing data. In other cases, errors can occur when extracting data from the source system. If that's the case, we could find the problematic code and then rerun the extraction program. Values outside the expected range can include things like ages that are less than zero or greater than 120, month numbers that are less than one or greater than 12, dates on transactions that are earlier than the time the business started. Or, if we have codes, like postal codes or zip codes that are outside of the range of defined values.
Malformed data is another problem. Malformed values can occur when working with strings, and some examples include names or descriptions that are truncated, dates that are formatted with invalid characters, and free form text values that have no space between words. Business rules are rules that specify conditions that must be satisfied for the data to make sense. Examples of violated business rules include an order total that is greater than the customer's credit limit.
The date on an order indicates that it is being shipped earlier than the date that it was actually ordered. Or a case where a product ID in an order refers to a product that's no longer available. Another category is inconsistent values. And inconsistent values occur when the relational integrity is not enforced in database. And this can happen for a number of reasons. And the things we may find is that an order may have an ID, order items are associated with an order by using order ID, but in some cases the order item has an order ID that does not exist in the order table.
So that's an example of an inconsistent value. For these reasons it's important to perform data quality checks early in the exploratory data analysis process.
- Exploratory data analysis vs. hypothesis-driven statistical analysis
- Performing data quality checks
- Calculating quartiles
- Using box plot to understand the distribution of values
- Using histograms to understand the frequency of values
- Using chi square to understand the correlation between values