Before we get started working with data, I wanted to take a couple of minutes and give a little background on R, and some context on how it's used today. This is a good thing to do, because for many people, R is something of a mythical beast. They have heard of it, and they have been told that they should use it, but they don't really know why or how. The problem is that it's very hard to leave the comfort of familiar approaches to data, like SPSS, or SAS, or even more frequently, Excel, without understanding a little better what's to be gained by the exercise.
Let me start out with a little bit of history. R was originally developed by Ross Ihaka and Robert Gentleman who were both statistics professors at the University of Auckland in New Zealand. Ross wrote a short paper about the history of R called, R: Past and Future History. That is available on the R project Web site. Their original goal was to develop software for their students to use, but when they made their first public announcement of R's development in 1993, they were encouraged to make it an open source project.
Now, as a note, let me just add that R is not a statistics program per se, but a programming language that works very well for statistics, and was developed with that purpose in mind. It was based on S, another single letter programming language that was also developed. It was statistical analysis, and which still exists, primarily in its incarnation as S+. Anyhow, early alpha versions of R were released in 1997, version 1.0 came out in 2000, and 2.0 came out in 2004. Version 3.0 is due in mid 2013.
What's most fascinating to watch is the growth of R, especially compared to programs like SAS, which goes back to the mid 60s, and which has a substantial corporate structure around it, or SPSS, which was also developed in the 60s, and which is now owned and developed by the industry giant IBM. The wonderful r4stats.com Web site which is maintained by Robert A. Muenchen releases data annually on the popularity of several statistical packages, including R, SAS, SPSS, Stata, and several others.
And I'll just remind all of us one more time that unlike SAS and SPSS, which are very expensive, and can have very restrictive licensing requirements, R and all of its packages are free, and open source for anyone to download and use. It's true, though, that a lot of people are intimated by the fact that R is a command line programming language, and that they feel much more comfortable with dropdown menus, and dialog boxes. Fortunately, there are several free programs and packages that run as layers or shells over R that can provide just that kind of experience.
However, as the programmers like to say, the command line interface may not really be a bug, but instead, a feature. That is it makes it much, much easier to keep an explicit record of what actions were performed in an analysis, and to repeat them in the future. It also makes it easier to share those analyses with others, which make collaboration much easier. Also, it can facilitate the integration of R with other programs and languages, such as packages that allow R to work both ways with Excel -- that is, you can run Excel from R, and you can run R from Excel -- and even integrate R with SAS, and SPSS.
And so, for all of these reasons, R should be more than just a shadowy possibility for most people. Instead, as this course will show you, R can be easy, it can be informative, it can be fast, and believe it or not, it can even be fun.
The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
- What is R?
- Installing R
- Creating bar character for categorical variables
- Building histograms
- Calculating frequencies and descriptives
- Computing new variables
- Creating scatterplots
- Comparing means