Join Mike Chapple for an in-depth discussion in this video Converting dates, part of Cleaning Bad Data in R.
- [Instructor] Date and time values can be some of the trickiest data to manipulate in R. Fortunately the Lubridate library makes it easy to work with dates and times. Lubridate is an important package, and it is installed with the rest of the tidyverse, but it's not part of the core tidyverse. Therefore, while you don't need to install it separately, you do need to load it separately in your R scripts. Let's cover two of the most important tasks that you can perform with Lubridate, deconstructing dates and times and constructing dates and times.
If you have an existing date or date/time value, you can use functions from within Lubridate to pull out individual components of that date and time. For example, the year function extracts the year from a date, the month function extracts only the month value, and the day function extracts the day of the month. You can get more complex and derive some values from a date. For example, the wday function tells you the day of the week while the yday function returns the day of the year.
I'll show you some examples of these functions in a minute. There are similar functions for deconstructing times. For example, the hour function returns the hour of the day, the minute function returns the minute portion of the time, and the second function similarly provides the number of seconds in a time. These functions can all help you pull out specific values from a date or date/time object. You may also need to construct a date value from its component parts. For example, you might have the month, day and year in a string, and want to convert that string to a date format.
This is a little more complicated than it sounds because the same date may be written different ways in different parts of the world. Look at all these dates. These are all different ways of writing April 1st, 2018. Lubridate provides the date construction functions to help you build a date or date/time variable out of text strings. Let's start with simple dates. The functions differ based upon the order that the numbers appear in the string, and all have three-letter names corresponding to the order of the date elements.
For example, if April 1st, 2018 was written this way, you could use the mdy function to read it. That stands for month, day, year, and it's the order of the string. April is the month, one is the day, and 2018 is the year. However, if the date were written like this, you'd want to use the ymd function, or for this date, you'd want to use the dmy function. You can see the confusion here. If you didn't tell R the order of the date elements, it wouldn't be able to tell if this string was April 1st or January 4th.
Finally, you can tack a time onto the end of a date, and get a date/time data element by simply adding underscore hms to the end of the mdy, ymd, or dmy functions. Let's try these in R. I'll begin by loading the tidyverse and Lubridate, and then read in a data set containing some Mexican weather readings. Let's take a look at what that gives us. This data set looks like it contains over 33,000 records that include temperature readings from Mexican weather stations.
We have the station number, the element, whether it's a maximum or minimum temperature, the value, the temperature reading, and the date. Let's go ahead and try to extract different elements of the date into the tibble. I'd like to add year, month, and day columns. Let's start with year. I'm going to create a new variable called year in the weather tibble, and generate it by applying the year function to the date field that's already in that tibble. I'll then do the same thing for month and for day.
Go ahead and run those three lines, and now let's take another look at the weather tibble. As you can see I've now added three columns, year, month and day, that contain those specific elements of the date of the reading. Remember that the wday function allows you to determine the day of the week for a particular date. Let's go ahead and just check what day of the week April 1st, 2018 was. I'm going to use the wday function, and then put in the date, 2018-04-01.
When I do that, I get back the value of one. That tells me that this was the first day of the week, Sunday. I can also use the yday function on the date, and determine that April 1st, 2018 was the 91st day of the year. Now remember that Lubridate also allows us to build dates. If I write my dates in the standard American format, I can use the mdy function to convert them to dates.
Let's go ahead and convert April 1st, 2018 using the mdy function and I get back a return value in the standard time format. It's also okay if I abbreviate the year. If I just wrote 04-01-18, Lubridate figures out that I meant 2018. If the date is written in a European format instead, I can use the dmy function to convert the date, so if I put in that same string, instead of getting back April 1st, 2018, now I get January 4th, 2018 because I specified the day first, followed by the month and then the year.
Those are the basics of working with the Lubridate library in R.
Where possible, instructor Mike Chapple shows how to correct the issues using R, but the same principles can be applied to any statistical programing language.
- Missing data
- Duplicate rows and values
- Converting data
- Formatting data
- Working with tidy data
- Tidying data sets
- Dealing with suspicious data