Start free trial Sign in

From the course: R Essential Training: Wrangling and Visualizing Data

R's built-in datasets

From the course: R Essential Training: Wrangling and Visualizing Data

Start my 1-month free trial

R's built-in datasets

“

- [Instructor] Perhaps the best way to get up and running quickly with R, is to explore the built-in sample datasets that, come installed with R. To get to these, what you need to do is open the dataset package. Let me show you how to do this. I'm using this script right here, that O three, O one, built in datasets dot R. And what we need to do is come down here and load the package. Now, the datasets package comes with R, it's part of the default installation. However, it's not loaded, it's not active in memory by default. And so by using library, and then in parenthesis, dataset, we'll load it. And we'll make it available. You can also use require. I'm going to run that command and now it's available to us. Now, let's get a little bit of help on the dataset package, and I do that by using question mark datasets. And when I run that command, you'll see over here. We get this help information, and it talks about the datasets package. Now, it's not telling very much right there, so there's another way to get better information. And there it is, you use this one. Library and then help equals datasets. When we run that command, it's going to give us this information and this is a list of all of the datasets that are included in that package. There's a little over a hundred I believe. So it gives you their title and it gives you a very short description of what's involved in each one. But there's a lot more that we can do than that. Let's close that window and come back here. And let's get an interactive list, something that tells us more about each of them, where you can get a complete description. Now, come to the help viewer and click on the home icon. And then come down here to packages under reference. Now, your list of packages will be a little different from mine because I have installed a bunch of different ones. But come down here to datasets. And when we click on that link, it opens up an interactive webpage right here in the viewer, and you can come down here and you can see what is in the different websites. Rephrase it, and you can see what's in the different datasets. So for instance we have cars right here. And this tells us that we have 50 observations on 2 variables. And it gives some examples of what you can do with that dataset. Now, let's take a look at a few very common datasets that are used not just in this course but really, I've seen these in so many different places in the data world, it's nice to know that they exist right here in the R datasets package. One of the most common is the Iris dataset, that means Iris flowers, and it's attributed to either Fisher or Anderson or both. And let's do question mark, Iris got a little bit of information on this one. And that's going to open up right here. It says, "Edgar Anderson's Iris Data" also known as Fisher's. And it's 50 flowers from each of three species of Iris with four measurements on each. If you want to see the actual dataset we just call it's name, Iris. Once we do that, this is the dataset. And it's very frequently used to model categorization systems, or classification, where you say, "based on the measurements, can we decide whether a flower falls into one of these three different species." We'll be using the Iris occasionally as demonstrations and I'm sure you'll encounter it in other places. Another one is a dataset about the survival from the disaster, the sinking of the Titanic, the ship. We can get information about it by doing the question mark, and then it tells us it has a few different variables, they're all categorical. And then we can see the data by simply calling Titanic. And I'll open this up. And then here you see it's broken down in tables. This is a different way of representing data in R. And it's very convenient for certain kinds of analysis, for others you need to restructure the data, and we'll talk more about that elsewhere. Another one is Anscombe's quartet, and what this is, is four very small datasets that in certain ways are identical. They have the same means and standard deviations, the same correlation and regression coefficients. But when you graph them, they're dramatically different. And they exist to let you know it's really, really important to graph. And if you want to see the entire dataset, this is all of it. It's 11 rows and it's eight columns. Now there are a lot of other datasets, and I'm going to show you some of the others, some of them are enormous, 30,000 data points. And they can be used for sophisticated machine learning tasks, and what you'll find is that there are datasets that are well adapted for almost any procedure you might want to do, as well as additional special datasets that come in the packages you can add into R. And I'm going to show you more about that in the next movie.

Contents