Join Mike Chapple for an in-depth discussion in this video Types of missing data, part of Cleaning Bad Data in R.
- [Narrator] One of the most common problems that you'll face when performing data analysis is that there may be data missing from your data set. This can be a very troubling problem and requires careful thought. There are three different types of missing data, and the way that you handle each varies. They are data that is missing completely at random, MCAR, data that is missing at random, MAR, and data that is missing not at random, MNAR.
Let's dive into each one of these situations. The first is data that is missing completely at random, MCAR. This situation arises when circumstances cause some data to be missing from your data set, but there is no relationship between the missing data and circumstances that may play a role in your analysis. For example, imagine that you have a spreadsheet of results from a data collection effort and the rows in that spreadsheet are sorted according to a randomly generated identifier.
If something happens to the spreadsheet and you lose the last 100 rows, that's a situation where the data is missing completely at random. The order of the rows was completely dependent upon a random variable, and the only thing that the rows had in common was that their random values were the 100 highest in the data set. Now that might sound like a pretty contrived example, and it is. The reason is that there are very few real world situations where your data is truly missing completely at random.
It's much more likely that your data will fit into one of the other two categories. Situations where data is missing completely at random can normally be ignored because they won't have an impact on your analysis. The second category of missing data is data that's missing at random, MAR. This situation occurs when there are some underlying circumstances that explain the way that data is missing, but those circumstances are explained by other variables in the data set.
As with data that is missing completely at random, we can often ignore situations where data is mission at random and still draw valid conclusions from the data set. The most serious situation occurs when data is missing not at random, MNAR. In these cases, we're once again missing some information, but the value of the mission variable is related to the reason that that variable is missing. Imagine, for example, that we're measuring blood pressure of individuals and we have a meter that is only able to measure blood pressure values up to 180.
If there are individuals in our population with blood pressures over 180, those value will be missing because they were too high to read. Situations where data is missing not at random are very serious because the absence of the missing data will impact your conclusions. So we've talked about understanding the impact of missing data on your analysis based upon the reason that the data is missing. How do we handle situations where we have missing data? We have several options at our disposal and our response will vary based upon the particular circumstances of our data set and require some subject matter expertise.
We may be able to figure out the missing values based upon analysis of other available data. We might be able to interpolate missing values based upon other observations. Or we might be able to simply ignore the missing data.
Where possible, instructor Mike Chapple shows how to correct the issues using R, but the same principles can be applied to any statistical programing language.
- Missing data
- Duplicate rows and values
- Converting data
- Formatting data
- Working with tidy data
- Tidying data sets
- Dealing with suspicious data