Tibbles are the core data structure of the tidyverse. All of the tidyverse packages are optimized to work with tibbles to facilitate the smooth exchange of information. Tibbles also offer analyst-friendly features that are easier to use than base R data f
- [Narrator] Tibbles are the core data structure of the tidyverse, and are used to facilitate the display and analysis of information in tidy format. If you've used R in the past, you're already familiar with the concept of a data frame. Data frames are the most common data structure used to store complex data sets in base R. They allow analysts to combine variables of differing data types in the same data structure for ease of analysis. Tibbles have a few advantages over data frames that make them much easier to use.
First, all tidyverse packages are designed to use tibbles for both input and output where applicable. This standardized data format makes it easy for our developers to string together functions from different tidyverse packages without having to perform intermediate data conversions. Second, tibbles include functionality that makes it much easier to print and display output than basic data frames. I'll show you that in more detail in a moment.
Third, tibbles don't make assumptions that you commonly find when creating data frames. For example, if you've created data frames before, you probably know that data frames often convert character strings to factors and analysts often have to override this setting. Tibbles don't try to make this conversion automatically. Tibbles also don't mess with your variable or row names as data frames tend to do. I'd like to introduce you to three different ways that you can create tibbles. First, you can use the as tibble function to convert an existing data frame into a tibble.
If you already have data in R, and are just now moving to the tidyverse, you'll probably create tibbles this way. Second, you can use the tibble function to build your own tibble from scratch. And third, you can use the tidyverse's data import packages to create tibbles from external data sources, such as csv files or data bases. I'll cover this in the next section of this course. Now let's take a look at how we can build tibbles using the first two approaches in R.
I'm now at a blank console in RStudio. And for now, I'm just going to work with some of the data sets that are built into R as examples. In the next section of this course, I'll show you how you can pull your own data sets into R and work with them in tibbles. R contains a built in data set called CO2 that contains some data about carbon dioxide uptake in some grass plants. You can take a look at this data, stored in a data frame, by simply typing CO2 at the console and hitting enter.
Now, you might not be familiar with this data. And the contents aren't that relevant. If I scroll up, you can see that we have several variables here. Plant, type, treatment, concentration, and uptake. Looking at this I can tell that some of these are alphabetic characters, others are numbers, I don't know what types these data elements are stored in yet. Also, I had to scroll up, and it's kind of difficult to work with this information as it's presented right now.
Let me try converting this information into a tibble. I'm going to create a tibble called CO2_tibble. So let's go ahead and type CO2_tibble, that's the name of my new tibble, and assign to it the output of the as tibble function, and then use the CO2 data frame that we just looked at as our input. When I go ahead and run that, you can see I get no result, but now if I go and type CO2_tibble, and look at my result, I have a tibble in front of me.
Now I'd like to point out a few things here. First, the first row of this output says a tibble, it's telling me that my data is not a normal data frame, and that this is a tibble. Right next to that, I can see the size of my tibble. 84 X 5, that means I have 84 observations of five different variables. Now notice that R doesn't print the entire tibble. I didn't have to scroll back up to see this header information. Instead, it's only showing me the first ten rows of the tibble, and then on the last line it's saying with 74 more rows.
Usually, just showing the first 10 rows of the tibble is enough to give me an idea of what the tibble contains. Finally, the third row of the output tells me the data type of each one of these variables. I can see that plant is an ordered factor, type is a factor, treatment is a factor, and concentration and uptake are both doubles. That's very convenient to see there, and we didn't see that when we printed the data frame by itself. Let's say that I want to print more than this default 10 rows. I can use the print function to customize how the tibble appears on the console.
If I just go ahead and type print, and then supply it CO2_tibble as input, you'll see that I get the same output as the default. Just the first 10 rows of the tibble. However, if I type print CO2_tibble, and then specify that I'd like to set the variable N to 20, this time I get the first 20 rows of the tibble. If I want to see the entire tibble, I can use the same style of command, but instead of specifying N equals 20, I can say N equals inf for infinite.
Show me an infinite number of rows from the tibble. Now of course there only are 84 rows, but this prints the entire tibble. I can also build my own tibble from vectors. I'm going to start by just creating a few vectors that I can work with. First, let me build a vector called name. And I'm going to use the C function to just put some values into this name vector. I'm going to put in Mike, Renee, Matt, Chris, and Ricky.
And now I'm going to build a vector called birth year, that contains some years of birth for these individuals. 2000, 2001, 2002, 2003, and 2004. And then I'll create a vector called eye color, just to give me some interesting information to work with. And I'll fill that with strings for the color of these people's eyes, we'll say blue, brown, hazel, brown, and blue.
I now have three vectors that aren't related to each other yet containing some information about these people. I can use the tibble function to take this information from these vectors and combine it into a tibble. So I'm going to create a tibble called people, and assign to it the output of the tibble function called using those three vectors as input, name, birth year, and eye color. And if I print this tibble, you can see that I've created a tibble with three variables and five observations.
Those are the basics of building and printing tibbles in R using the tidyverse.
- What's tidy data?
- Using the tidyverse
- Working with tibbles
- Subsetting and filtering tibbles
- Importing data into R
- Making wide datasets long with gather()
- Making long datasets wide with spread()
- Converting data types in R
- Detecting outliers
- Manipulating strings in R with stringr