From the course: R for Data Science: Lunch Break Lessons

Dealing with NA

- [Instructor] In your work with R you're going to come across a value called NA. And it's something that you'll need to deal with because it can gum up some of your research and results. So let's take a look at NA and how to work with it. NA stands for Not Available, and it appears any time there is a missing value. So if you import a CSV file or if you do a calculation where there's a missing value, you'll get NA. And it looks like this, capital N-A. Now you can test for that with is.na, and then you give it something that might be an NA. So, in this case, we'll feed it NA, and it'll come back as True because NA is, in fact, NA. You can test for other values, is.na, let's use NaN, which is Not a Number, and that's different than NA. In this case, it's going to come back as True, because, well, it's still not a number. Now, let's test for something that is absolutely not NA, is.na, say one. And, in this case, it should come back as false. And it does, because one is not a not available number. Be careful, is.na, quote, NA is going to come back as false, because quote NA is a string. It's not a not available value. NA is a unique value to itself. You can test the contents of a vector. Let's set up something called test_vector. And into test_vector, I'm going to put the values one, comma two, comma three, comma NA, comma five. And I hit Return and you can see in our Environment that I now have a test_vector with the values one, two, three, NA, and five. So let's go ahead and test that. If I say is.na and I type in the name of the vector that I've just created, test_vector, and I hit Return, I get False, False, False, True, False. And the True indicates the position of the NA in that vector. You'll notice that the value of the vector is one, two, three, NA, five. The result says False, False, False, True, False. There are other tests related to NA. One of them is called anyNA, and you give it a vector, so we'll give it our test_vector, and when I hit Return what I see is True. And what this is saying is that are there any values of test_vector that are NA? In this case, the result is true. Some functions have the ability to deal with NAs and it's built in. So let's look at one of 'em, one is called mean. It calculates, no surprise, the mean of a value or a vector. So we'll give it our test_vector and hit Return. And what I get, surprisingly, is the value NA. And what this tells me is that test_vector has an NA built into it. And if I try to calculate the mean of a vector with an embedded NA, mean comes back and says, "I don't know what to do with this, "so I'm going to give you NA as a result." You can tell mean to ignore that if you go and you type in mean test_vector, comma na.rm, and that stands for NA remove. And I want to say yes, true I want you to remove NAs, and when I hit Return, now I get the mean of test_vector with the NA values removed. Now, keep in mind, those NA values may have significance of their own so you can't necessarily just remove the NA values, but in case if you do need to remove 'em, this is a really quick way to do this. Sometimes you'll want to convert an NA to a zero or another value, and there's a shortcut for doing that. It's called ifelse. And what I'm saying is ifelse, and I give it a true or false condition. So, in this case, I'm going to say if there is an NA in test_vector, which we know there is NA value in there, then return a zero, if there's not then return the value of test_vector. And when I run that, what you'll get back is one, two, three, zero, and five. So what ifelse has done is converted the NA to a zero. There's another way to do this and that's called subsetting, it's a standard R process. I'll type in test_vector, and then I'll type in a subset. And, in this case, I'm going to subset out anything that is NA in test_vector. So I type test_vector, bracket, is.na, and then the name of what I'm searching for, which is test_vector, and into those values, I'm going to substitute zero. Now, when I hit Return, I want you to watch to the right-hand side in the Environment. Right now test_vector contains one, two, three, NA, five. When I hit Return, you'll see that test_vector now contains one, two, three, zero, five. So what that has done is searched out any NAs in test_vector and substituted a zero for that. So that's NA. Again, you'll occasionally run into it where it's embedded in data that you're using. And it has significance all its own, but if you're trying to perform calculations around it, there are several strategies you can use. Just a reminder, subsetting permanently changes the value in the test vector.

Contents