Join Mike Chapple for an in-depth discussion in this video What you need to know, part of Cleaning Bad Data in R.
- [Instructor] I've designed this course as an introduction to the concepts of cleaning bad data. Many of the concepts discussed in this course were collected by the analytics community in a document called the Quartz Guide to bad data. It's definitely worth a read if you'd like to dig into data quality issues in greater detail. You won't need any background in those fields to complete this course. This is however an intermediate level course and I do assume that you already have a basic knowledge of data analytics. I'll be showing you examples of cleaning data using the R programming language, the Rstudio integrated development environment, or IDE, and the Tidyverse libraries.
If you're not familiar with these tools you have two choices. First you can simply move ahead in the course and you'll still learn quite a bit. I've designed the course to cover the concepts of data cleaning and you should be able to follow along with my examples even if you normally use another programming language. Second, you can take the time to develop your data wrangling skills in R first. My course, Data Wrangling in R, available on this site, provides such an introduction.
Where possible, instructor Mike Chapple shows how to correct the issues using R, but the same principles can be applied to any statistical programing language.
- Missing data
- Duplicate rows and values
- Converting data
- Formatting data
- Working with tidy data
- Tidying data sets
- Dealing with suspicious data