Tidy data is structured in a manner that makes data analysis easy. In this video, Mike Chapple explains the basic concepts of tidy data and the tidyverse in R.
- [Instructor] The goal of this course is to help you use R to transform your data sets into a consistent format known as tidy data. You do this through a process known as data wrangling. Data wrangling is the art of taking messy data and manipulating it into a format that is well-suited for analysis. It goes by many other names. Some people call this work data cleaning, data munging, or data preparation. Whatever name you choose to use, it's important to remember that this is not a one time task.
While it's true that most data projects will involve a lot of data wrangling up front, data wrangling is a continuous process. And as you encounter new data sets, new problems, and new ideas during the course of your project, you'll likely return to perform some new data wrangling. The term "tidy data" describes data that has been put into a standardized format that facilitates future analytic work. Hadley Wickham, a data scientist who is one of the key developers of the R language, coined the term "tidy data" in this paper that he published in the Journal of Statistical Software in 2014.
Throughout this course, I'll refer back to the principles that Wickham outlined in this paper, as it is considered one of the most important works in the field of data wrangling. I encourage you to go back and read this paper yourself after you complete this course. You'll find that it is full of examples that help illustrate the concepts of tidy data. One quick word of warning: the tidyverse is rapidly evolving. Some of the material that I cover in this course is more recent than that covered in Wickham's paper. Converting data from its original format into tidy data is difficult, time-consuming work.
Why would we want to spend the time and effort required to create tidy data? Well, there are three main reasons. First, tidy data facilitates initial data exploration and analysis work. If our data is in a standardized format, it's much easier to notice trends, anomalies, and other important features of our data sets. Second, tidy data improves our ability to collaborate with others. If our data is in a standardized format, we can easily share it with other people, who will then be able to quickly begin analyzing it without having to go through their own data-wrangling work first.
And finally, if we convert our data to a tidy format, we can take advantage of many R packages that accept tidy data as input without performing additional transformations. This sounds great, right? The trick is that, while tidy data has a consistent format, you'll need to figure out how to convert your existing data into that format. Wickham summed it up best in his paper by quoting Tolstoy, who once said, "Happy families are all alike; "every unhappy family is unhappy in its own way." Wickham drew the parallel to tidy data by tweaking this to say "Tidy data are all alike; "every messy data set is messy in its own way." Your job in wrangling data is to develop an understanding of your unique data sets, to figure out how they're messy in their own ways.
You can then use data manipulation tools in R to properly structure your data as tidy data. Once you've done that, a whole world of data analysis tools becomes available to you. Tidy data unlocks a set of tools known as the "tidyverse." The tidyverse consists of a set of R packages that work together to transform, analyze, and visualize tidy data. The tools of the tidyverse can easily share data with each other, and allow you to quickly take advantage of the power of R for your analysis.
- What's tidy data?
- Using the tidyverse
- Working with tibbles
- Subsetting and filtering tibbles
- Importing data into R
- Making wide datasets long with gather()
- Making long datasets wide with spread()
- Converting data types in R
- Detecting outliers
- Manipulating strings in R with stringr