Base R is the entirety of R as installed on your machine. The tidyverse is a collection of packages for data importing, cleaning, and wrangling. The tidyverse replaces many basic operations from base R. The tidyverse can be considered to be a collection of verbs: read, select, filter, group_by, full_join, sample, and more.
- [Instructor] So, what is the tidyverse? Well, the tidyverse is both a collection of R packages and an approach to how to do data science effectively and reproducibly with the R language. Understanding what that means requires us to cover a little bit of what is installed on your computer when you install R. For now, we're not going to install on our machines as that would require us to talk about operating systems and get a little bit distracted. First, let's discuss what the R language, Base R, and the Tidyverse R inter.
The R language is an extremely popular scripting language used by millions of people around the world. Primarily, it's used for data analysis, modeling and visualization, what we commonly call data science and that's also what we'll be focusing on in this course. Many people think of R as statistical software but it's fundamentally not and it's a little bit upsetting when people say that. R is a programming language that has been adopted and curated by people interested in doing data science as flexibly as possible and without having to think about the actual programming side of things, what a computer's doing behind the scenes.
That's what makes it a really great scripting language to use for data science. R lives and breathes at the comprehensive R archive network, abbreviated to CRAN. This is where we will install R from later as well as most of the other R packages that we'll talk about. When you download R from CRAN, you've actually installed Base R, Base R includes all of the necessary gubbins or machinery for your computer to be able to run R code.
It also installs standard R packages like stats, utils and graphics. These packages allow you to start using R immediately on your machine. Most R courses on our library use Base R for data manipulation. The Base R way of doing things involves a code that looks very much like this, iris[iris$Species first to access to species column and then we have a double equals to say species is equivalent to virginica and then we have a comma to say we want all of the columns and then the final closing square bracket.
So, Base R, there's lots of dollar signs and lots of square brackets. Now, it's possible to do every single thing you could possibly imagine with Base R because it is a true and complete programming language but you would have to write a lot of code yourself. Most people jump straight into using R packages to make their life easier and more reproducible, so what are R packages? R packages are self-contained collections of functions and/or datasets that provide you with the ability to do any number of things from analyzing data, visualizing data to potentially even generating reports with R which is what R markdown allows us to do.
Now, CRAN has over 10,000 packages available at the time of recording this course and this comprehensive range of packages available from CRAN is part of what makes R such a popular scripting language. Odds are there are definitely one or two packages that would make your life with R a little bit easier, i.e. they're designed to do the kind of analysis or data visualization which is important to your domain-specific knowledge. Now we know these things, we can talk about what the tidyverse is.
The tidyverse is an ecosystem of R packages designed to work consistently and interdependently together to provide a flexible and easy-to-understand workflow for doing data science with the R language. The fundamental building block of the tidyverse is the concept of tidy data which this course will introduce slowly through worked examples. The tidyverse has been in development since early 2014 and is becoming increasingly mature but it's important to understand what the tidyverse is not.
The tidyverse should never be considered a replacement for Base R. If you're new to the R language, it will remain crucial to understand the base R way of doing things. You may work with all packages that don't sit particularly well with the tidyverse or you may be unable to even install the tidyverse in some places. This typically happens in commercial or server applications and particularly in pharmaceutical sciences.
It's also important to understand the tidyverse is not finished and likely never will be. As new ideas and people come to use the tidyverse, the range of things it does will only increase and sometimes it may fundamentally change. The tidyverse is also not a closed shop. It's developed openly by the RStudio developers including chief data scientist Hadley Wickham on GitHub. If you have ideas or contributions to add to the tidyverse, then have a look at package repositories or join conversations on Twitter using the hashtag tidyverse.
Now we understand more about what the tidyverse is, we should really consider why we would want to use the tidyverse before launching into setting ourselves up to use it.
This course introduces the core concepts of the tidyverse as compared to the traditional base R. It focuses on the novice user and those unfamiliar with the pipe (%>%) operator. After covering these R basics, instructor Martin Hadley progresses to importing and filtering data from Excel, CSV, and SPSS files, and summarizing and tabulating data in the tidyverse. Then learn how to identify if data is too wide or long and convert it if necessary, and conduct nonstandard evaluation. By the end of the course, you should be able to integrate the tidyverse into your R workflow and leverage a variety of new tools for importing, filtering, visualizing, and modeling research and statistical data.
- Understanding the pipe (%>%) operator
- Importing .xlsx and .csv files
- Filtering and summarizing data sets
- Using tidyr to convert wide and long data sets
- Non-standard evaluation and programming with the tidyverse