Join Barton Poulson for an in-depth discussion in this video R, part of Data Science Foundations: Fundamentals.
- [Voiceover] After spreadsheets, the first tool I wanna talk about in Data Science is R. That's the statistical programming language, that goes by the single letter name R. It can easily be argued that R, is the language of Data Science. Take a look again, at the KDnuggets Poll that we saw previously. This is a survey of data mining professionals, and R is the single most commonly used tool, almost twice as much as everything else. And it's 50% more used than Python, which is considered its major competitor with its specialized statistical packages.
There's a few reasons for this. Number one, R is Free and Open Source. That's an advantage, because some of the proprietary programs can be extremely expensive. Second, R is optimized for Vector Operations. That makes it possible for R to work through an entire collection of data, without having to write explicit for loops in it, saves a lot of time. Third, R has a great community behind it. There's an immense amount of support, and you can almost always find help on anything you're trying to do in R. And finally and most importantly, there are currently over 7,000 packages available for free for R, and these tremendously extend the utility of R.
Let's take a quick look at some of the R Interfaces, that is, how do you actually work in R? Now, R comes with its own Interactive Development Environment or IDE. And it's fine, it's a command line interface, but it's not totally consistent from one operating system to another, and it's got separate windows floating around, and that's a little problematic. You can also run R from the terminal, or from the command line. On the Mac, you just open up the terminal, and you type R, and you're ready to go.
But the most common is one called RStudio, and that's what you see right here. It's just a window that overlays on R, you have to install R separately. But it organizes it, and it makes it a lot easier to work with. It's also consistent from one operating system to another. Now, there is another choice. There's something called Jupyter, which people know best through Python, and IPython. Jupyter is a great way of working with code and sharing. On the other hand, in my experience, it's not quite yet ready for prime time with R, but I'm sure that will change in the very near future.
And again, as a reminder, in every one of these, the interface is command line, your typing lines of code like you see here on the top left. People sometimes talk about RStudio as being a GUI or a graphical user interface. Well, that's just for run command and copy and paste. In order to do analysis, you still have to type lines of code. Let's take a quick look at some of those commands you get in the lines of code. First off, you can enter the commands in the console. Or, you can save and selectively run them from scripts, and this is a script that I have right here.
Now, R's a little different from other languages. You may see for instance, that we don't have a semicolon at the end of the command. On the other hand, the white space here is not meaningful, the same way it is for Python, and just the way of doing things is a little unusual. On the other hand, once you work with R for awhile, it makes sense, and it turns out to be, a very effective way of working with data. Now the R output. The graphs are shown in a separate window, and texting numbers like this, are shown in a console.
And that means they don't stick around, they'll disappear, unless you specifically save them. On the other hand, those results can be written to files, to make them more permanent. But the important thing about R, is the existence of packages. So for instance, there's something called CRAN at cran.rstudio.com, and that stands for the Comprehensive R Archive Network. Really it's where all the packages are. And you go to CRAN, and you can search by Topic or by Task View, which is organized by topic.
So here I have a little clip of the packages for learning Bayesian Statistics, and it's actually a very long list, this is just the first part of it. And ideally, any package you download, is going to come with sample data sets, it's going to come with a user manual, and it will come with vignettes, or demonstrations of how it works. Another option is a site called Crantastic, with an exclamation mark. It's at crantastic.org. It's an alternative interface because it still links back to CRAN, but it shows popularity, it shows recency, and it can be a great way for exploring some possible packages for using your own work.
And so what are our conclusions on this very brief presentation of R. First, R is central to data science. Second, it's a command line interface, and it's something where you're gonna be typing lines of code. And it gets its great power from the thousands of packages that are freely available to expand its capabilities and its utility.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data