navigate site menu

Start learning with our library of video tutorials taught by experts. Get started

Up and Running with R

Up and Running with R

with Barton Poulson

 


Join author Barton Poulson as he introduces the R statistical processing language, including how to install R on your computer, read data from SPSS and spreadsheets, and use packages for advanced R functions.

The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
Topics include:
  • What is R?
  • Installing R
  • Creating bar character for categorical variables
  • Building histograms
  • Calculating frequencies and descriptives
  • Computing new variables
  • Creating scatterplots
  • Comparing means

show more

author
Barton Poulson
subject
Developer, Programming Languages
software
R
level
Beginner
duration
2h 25m
released
Apr 04, 2013

Share this course

Ready to join? get started


Keep up with news, tips, and latest courses.

submit Course details submit clicked more info

Please wait...

Search the closed captioning text for this course by entering the keyword you’d like to search, or browse the closed captioning text by selecting the chapter name below and choosing the video title you’d like to review.



Introduction
Welcome
00:00(music playing)
00:04Hi! I'm Barton Poulson, and I'd like to welcome you to Up and Running with R. R is an
00:09open source program and programming language that has become one of the most
00:13powerful choices available for statistical analysis.
00:16In this course, I'll teach you to use charts, such as histograms, bar charts,
00:21scatter plots, and box plots to get the big picture of your data;
00:25descriptive statistics, such as means, standard deviations, and correlations for a
00:29more precise depiction; inferential statistics like regression, T-tests, the
00:34analysis of variance, and the chi- square test to help you determine the
00:37reliability of your results.
00:40Finally, I'll demonstrate how you can create beautiful charts for presentations,
00:44and share your results with other people.
00:46If you're ready to get going, let's get started with Up and Running with R.
Collapse this transcript
Using the exercise files
00:00If you're a premium member of the Lynda.com Online Training Library, then you
00:04have access to the exercise files used throughout this title.
00:08The exercise files are contained in a folder, and there is one R project
00:12folder for each movie.
00:13Inside the R project folder, you'll find the R file, and any other files needed
00:18to follow along with the movie.
00:20If you're a monthly subscriber, or an annual subscriber to Lynda.com, then
00:24you don't have access to the exercise files, but you can follow along from
00:27scratch with your own data.
00:28And with that, let's get started.
Collapse this transcript
1. What is R?
R in context
00:00Before we get started working with data, I wanted to take a couple of minutes
00:04and give a little background on R, and some context on how it's used today.
00:09This is a good thing to do, because for many people, R is something of a mythical beast.
00:14They have heard of it, and they have been told that they should use it, but they
00:18don't really know why or how.
00:20The problem is that it's very hard to leave the comfort of familiar approaches
00:23to data, like SPSS, or SAS, or even more frequently, Excel, without understanding a
00:29little better what's to be gained by the exercise.
00:32Let me start out with a little bit of history.
00:34R was originally developed by Ross Ihaka and Robert Gentleman who were both
00:39statistics professors at the University of Auckland in New Zealand.
00:43Ross wrote a short paper about the history of R called,
00:46R: Past and Future History. That is available on the R project Web site.
00:50Their original goal was to develop software for their students to use, but when
00:55they made their first public announcement of R's development in 1993, they were
01:00encouraged to make it an open source project.
01:02Now, as a note, let me just add that R is not a statistics program per se, but a
01:07programming language that works very well for statistics, and was developed with
01:10that purpose in mind.
01:11It was based on S, another single letter programming language that was also developed.
01:16It was statistical analysis, and which still exists, primarily in its incarnation as S+.
01:22Anyhow, early alpha versions of R were released in 1997, version 1.0 came out in
01:282000, and 2.0 came out in 2004.
01:32Version 3.0 is due in mid 2013.
01:35What's most fascinating to watch is the growth of R, especially compared to
01:39programs like SAS, which goes back to the mid 60s, and which has a substantial
01:44corporate structure around it, or SPSS, which was also developed in the 60s, and which
01:49is now owned and developed by the industry giant IBM.
01:52The wonderful r4stats.com Web site which is maintained by Robert A. Muenchen
01:58releases data annually on the popularity of several statistical packages,
02:02including R, SAS, SPSS, Stata, and several others.
02:07And I'll just remind all of us one more time that unlike SAS and SPSS, which are
02:13very expensive, and can have very restrictive licensing requirements, R and all
02:18of its packages are free, and open source for anyone to download and use.
02:23It's true, though, that a lot of people are intimated by the fact that R is a
02:27command line programming language, and that they feel much more comfortable with
02:31dropdown menus, and dialog boxes.
02:33Fortunately, there are several free programs and packages that run as layers or
02:38shells over R that can provide just that kind of experience.
02:42However, as the programmers like to say, the command line interface may not
02:46really be a bug, but instead, a feature.
02:49That is it makes it much, much easier to keep an explicit record of what actions
02:54were performed in an analysis, and to repeat them in the future.
02:57It also makes it easier to share those analyses with others, which make
03:01collaboration much easier.
03:03Also, it can facilitate the integration of R with other programs and languages,
03:08such as packages that allow R to work both ways with Excel -- that is, you can run
03:13Excel from R, and you can run R from Excel --
03:16and even integrate R with SAS, and SPSS.
03:20And so, for all of these reasons, R should be more than just a shadowy
03:24possibility for most people. Instead, as this course will show you, R can be
03:29easy, it can be informative, it can be fast, and believe it or not, it can
03:34even be fun.
Collapse this transcript
2. Getting Started
Installing R on your computer
00:00R is a free download that's available for Windows, Mac, and Linux computers, and
00:05installation is a simple process.
00:07The first thing you need to do is go to the R Web site; that's r-project.org.
00:13From there, you can scroll down to where it says Getting Started, and you see
00:18download R. I'm going to click on that right now.
00:21When you click on that, you get to choose what's called a CRAN Mirror.
00:25CRAN stands for the Comprehensive R Archive Network, and what these are are
00:30servers that have identical copies of all of the R information, and it's usually
00:34helpful to find one that is physically close to you.
00:37I'm going to scroll down to the United States, and I'm close to UCLA right now,
00:44so I'm going to click on that one.
00:47From there, you have three choices, depending on the operating system of your machine.
00:52You can download R for Linux, for Mac or for Windows, and most people will be
00:57downloading R for Windows.
00:58Let's click on that one first.
01:00From there you have a few different choices.
01:02The one that most people are going to want is base, and then you can simply
01:06download it by clicking on that top link.
01:08I'm going to back up and show you the Mac version.
01:12If you click on Mac, then the one that you want is this one right here that says
01:17package. R2.15.2 is the current version.
01:21Then one more; I'll back up, and for Linux users, the version that you download
01:27depends on the distribution of Linux that you're using.
Collapse this transcript
Using RStudio
00:00R is a very popular language for working with data, but not everybody wants to
00:05do their work in the R application.
00:07Some people prefer having one window that shows everything they need. Many
00:11people prefer graphical user interfaces, or GUIs, to command line programming.
00:15In addition, the default interface for R looks and acts somewhat differently in
00:21each operating system, which complicates courses like this one.
00:25Fortunately, because R is open source, a number of options to the standard R
00:29environment have been developed.
00:30The list is rather long, but I wanted to just mention one in particular right
00:34now, and that's RStudio.
00:37If you go to the Web, and go to rstudio.com, you have the option of downloading
00:42RStudio, and what this is an IDE, or an integrated development environment for
00:47R. It's simply a layer that goes over the top of R, which has to be installed separately.
00:52It's a free download, so simply come down here to the bottom left, and click on
00:56Download Now, and then choose the version that you want.
01:00You can choose a different one, depending on your operating system, or you can
01:03even install RStudio in a Web browser remotely.
01:07We've already downloaded it and installed it on our computer here, so once it's
01:11installed, you'll have an icon like this on your Desktop.
01:15I'll double-click on that to open up RStudio, and you see that what we have on
01:19the left here is the R console, with the exact same text that shows up when you
01:23open R in the R application.
01:26In fact, again, it's identical here, the coding is identical; this is simply a
01:30different arrangement, but it allows consistency between Mac, Windows, and
01:34Linux. That's important.
01:37Also, it makes it easier to get to the help information, the package information,
01:41and other sources that we will be using throughout the course.
01:44Now, here in the console is the exact same text that shows in the R console when
01:48you open up the R application.
01:49Again, that emphasizes that RStudio is simply a layer over the top.
01:54It allows you to have several windows open simultaneously, organizes them, makes
01:58it easier to deal with things like packages, and the help, and the workspace, and
02:02the history, which we'll talk about later.
02:04But for right now, I want to make it clear that this is the same program
02:08accessing the files R; interchangeable in either one.
02:12There are a couple of advantages to R, aside from the fact that it's consistent
02:16from one platform to another.
02:17For instance, it allows you to divide your work into multiple contexts, each
02:21with their own working directory, workspace, history, and source documents.
02:26And you do that by coming up to the top right to Project, and creating different
02:31projects with different settings.
02:33Another one of the big advantages of RStudio is that it has built in
02:37GitHub integration.
02:38So, if you're going to be using versioning, this is a huge advantage in using this.
02:42Also, it's easier to work with graphics, especially in terms of exporting them in
02:46several different formats, and resizing them.
02:49You will also have the possibility for interactive graphics with the
02:52manipulate package in RStudio.
02:54There are a lot of other options available for most graphical user interfaces,
02:59and other kinds of interfaces that can be laid over the top of R. Some of the
03:03other ones are the R GUI; that comes the precompiled version of R for Windows.
03:08There's another one called R Commander. There's RExcel, which allows you to use R
03:13and R Commander from within Microsoft Excel.
03:16There's Revolution Analytics, which is elaborately developed for enterprise
03:20use and with big data.
03:22If you want to see a more complete list of what your options are, you can simply
03:25go to the Wikipedia article on the R programming language, which has a section on
03:30graphical user interfaces.
03:32You can also see the Journal of Statistical Software from June of 2012 that
03:36discusses GUIs for R.
03:38RStudio can be one attractive option among many for working with R. It's a good
03:44idea to spend a little time exploring the alternatives, so you can find what
03:47works best for you, and for your own projects.
03:50With that in mind, we'll be using RStudio throughout this course, because it
03:53allows us to have consistency between different platforms.
03:56Those of you who've worked in Java or C++ may be familiar with Eclipse. That also
04:01gives a consistent interface across platforms, so this is a similar idea.
04:05It's important to remember, though, that RStudio is not a replacement for R, but
04:09a layer over the top.
04:10You still need to have R installed on your machine, and RStudio will simply
04:13access that application.
04:15In addition, the files that you create in the editor are saved in the native .R
04:20file format, and are completely interchangeable with R's default interface.
04:23In fact, in creating this course, I have used both RStudio and the default Mac
04:28interface for R, so there's no problem going from one to the other.
04:31Whatever interface you use, you'll still have the same incredibly rich and
04:36flexible experience with R's considerable powers, which is where we'll turn
04:39in the next movie.
Collapse this transcript
Getting started with the R environment
00:00Let's start by taking a look at R when it first opens.
00:04For this course, we'll be using the RStudio interface, so I'll begin by
00:08double-clicking that icon on the Desktop.
00:12If you want to use the default R application, just double-click on
00:15the appropriate icon.
00:16R is the 32-bit version for older computers.
00:19R64 is the 64-bit version, which most people will want, and that will become
00:25the default in R3.0.
00:27Either way, once they're open, they appear identical.
00:29Also, if you prefer to work in other environments, you have other choices.
00:33So, for instance, on a Macintosh, you can open up the terminal, and access R that
00:37way by simply typing the letter R at the command prompt.
00:41Similarly, in Linux, type R at the command, or you can set it up to use the text
00:46editor of your choice through the preferences or options.
00:49When you first open R, what you get is the console.
00:52That's what I have here on the left, and it comes up with a bunch of boilerplate texts.
00:56It tells me, for instance, the version that I'm using, it gives information about
01:01the license, it gives information about contributors, and citation, and also how to
01:05get some demos or help, and how to quit R in the console.
01:09In RStudio, it's easy to resize the windows by simply dragging the dividing line.
01:13Right here, I can make it smaller, or larger, and while the console is where the
01:18action happens in R, it's not the place where you want to be working.
01:22Instead, you want to be working in a script environment, because you can save that.
01:26Also, I want to clear the console first.
01:28On Mac and PC in RStudio, that's just Ctrl+L, or you can go up to Edit, and
01:35down to Clear Console.
01:37I'm going to use the Ctrl+L. It just clears out all the text, and then I'm going
01:42to open up a script.
01:43Now, you can either open a new one by coming up to File > New > Script, or you can
01:52click on this Menu option right here to create a new script.
01:56I've already written a script for this movie, so I'm going to open that by going
01:59up to this icon right here to open an existing file.
02:02I'm going to come down to where I have it; I'm in the Desktop; Exercise Files.
02:09This is chapter 2, movie 3, and there's the file.
02:14I'm going to double-click on that, and it opens up in RStudio.
02:17Now, I want to point out that there's a lot of code in this one, but almost
02:22all of it is comments.
02:23Anything that begins with the hash tag, or the number sign, and shows up in a
02:27light green here is a comment and it's not run.
02:29The actual coding is in the blue and the grey, you'll see.
02:33I can run each line here, and it will show up in the console one at a time.
02:38So, for instance, what I'm going to do is I'm going to come down to line 4 where
02:42I simply have 2 + 2 written,
02:44and as long as I'm anywhere in that line, on the PC I can hit Ctrl+Return, on
02:50the Mac I hit Command+Return, and it will run that line.
02:54So, now what you see in the console on the bottom is all in blue, 2 + 2, that's
02:58the command that I wrote, and then it included the comment after the hash tag,
03:02and then beneath that, it gives the output; the result of this one.
03:06Now, you can tell the command, because it appears after the command prompt, that's
03:09the greater than sign, and the response appears after this index number.
03:13So, the one in the square bracket is the index number for a vector. The idea is
03:19that sometimes it puts out a whole lot of numbers, and it gives you the index
03:23number for the first number in that line.
03:26In fact, I'll show you what it's like if there's more than one line.
03:29I'm going to come down to line number 6 in the script on the top where it says
03:341:100, and what that's going to do is it's going to print the numbers 1 to 100
03:38across several lines.
03:39The cursor is there, so I can just hit Ctrl+Enter on the PC, or Command+Enter on
03:44the Mac, and now you see we have the index numbers.
03:47The first line begins with index number 1, the second line begins with index
03:52number 17, and so on.
03:54So, when you get your output, and you get these little cryptic numbers in the
03:57square brackets, that's just giving you the index number for the vector
04:00that it's dealing with.
04:02Also, you may have noticed that there's no command terminator on these.
04:05For instance, I don't have to put a semicolon or any other mark at the end of the command.
04:10It simply does it one line at a time.
04:13If I have a command that's going to go more than one line, it's in parentheses,
04:17and I'll have examples of that later in this course.
04:19A customary thing, also, whenever you're learning a new language, like learning the
04:23R programming language is to learn how to write "Hello World!"
04:27This one, because it's text, I just put print, and then in parentheses, I put the
04:31text that I want in quotes.
04:33In this case, it's "Hello World!"
04:35So, I press Ctrl+Return on the PC, Command+Return on the Mac, and now I have my "Hello World!"
04:42I'm going to scroll down a little bit in this window.
04:47Because R is a programming language that was intended for working with data, it
04:51also works very well with variables.
04:54In line 11, I'm going to create a variable called x, and I'm going to put into it
04:59the numbers 1 through 5.
05:01Please note I have an assignment operator here; that is the <-, the arrow, and
05:09that's often read as gets, and so I would read this as x gets the numbers 1 to 5.
05:15I'm going to bring the cursor down there, and hit Ctrl+Return on my PC,
05:20Command+Return on the Mac, and you see now that I have x gets 1 to 5, and then it
05:26tells me that it's run that command, but also look off to right side, the top
05:30right; you see there in the workspace, it's telling me that I have now created a
05:34variable called x. It's an integer with five numbers in it.
05:38If I actually want to see the numbers that are in x, all I have to do is enter
05:42the name of the variable, just x, and then I've got this hashtag comment
05:47after it that says display the values in x. So, I'm going to hit Ctrl+Return
05:51to run this line, or Command+Return on the Mac.
05:53Now you see that I have five numbers: 1, 2, 3, 4, 5, and then the index number
05:58for the first one in the vector is 1, which is why that appears at the
06:00beginning of the line.
06:02Also, if I want to have a set of numbers that's not just sequential, but actual
06:06data, I have the option of using a function called concatenate. That's the C here.
06:12This is in line 13.
06:13I'm going to create a variable here called y, and I'm going to specify the
06:17values that I want in it.
06:19This time it's 6, 7, 8, 9, 10, and I put them in parentheses with the function c.
06:24Again, that stands for concatenate, or sometimes called combine, or collection,
06:30because it puts them all together into this one variable.
06:33I have the cursor in line 13.
06:34I'm going to press Ctrl+Return on the PC, or Command+Return on the Mac, and you
06:40see down in the console at the bottom, I now have in blue that that command has
06:44run, and if you look into the workspace on the top right, you'll see that I now
06:49have not just the variable x, which has five values; I now have a variable y,
06:55which is numeric values that also has five values.
06:59If I want to see what's in y, I can go back to the script on the top left here.
07:04My cursor is already at line 14, because in RStudio, any time you run a command, it
07:09bounces down to the next line, which is convenient.
07:11So, I'm going to press Ctrl+Return on the PC, Command+Return on the Mac, and now
07:17it shows me that I have these five values;
07:196, 7, 8, 9, 10, where the index number for the first one in the vector is 1.
07:23One of the really neat things about R is that it allows you to do vector-based
07:28mathematics, which is a way of working with what normally you'd call an array of
07:32data, but it allows you to do operations on them without having to specify for
07:36loops, and so the code can be much simpler here.
07:40So, for instance, I have five numbers in my variable x, I have five numbers in my
07:45variable y, and if I want to add them to each other, where the first one in each
07:50one gets added, the second one in each one gets added, because they have the same
07:53number, all I have to do is write x + y. So, here I'm in line 15.
07:58I'm just going to press Ctrl+Return on the PC, Command+Return on the Mac,
08:02and this time, it not only shows me the command, it automatically outputs the results.
08:07That's because I'm not saving it as a new variable.
08:10I'm just running it.
08:11So, here at the bottom of the console, you see that I now have 7, 9, 11, 13, and
08:1515, and those are the sums of the items in those two variables.
08:20Also, if I want to simply multiply each of the elements in x, I can do that by
08:24writing x * 2, and it will do each element, and it will output it that way.
08:29The cursor is already in line 16 in the script on the top left. I'm going to hit
08:33Ctrl+Return to run that line on the PC, Command+Return on the Mac, and you see
08:38down in the bottom console, it shows that it's run that particular command, x * 2,
08:43and it's got the output here.
08:45It's five numbers; the index number of the first number is 1, and it goes 2, 4, 6, 8, 10.
08:50I just want to mention a couple of things about style and putting things together.
08:54I showed you that the assignment operator when you want to put values into a
08:58variable is this arrowhead, and so you say y gets the concatenation of 6, 7,
09:048, 9, 10 in line 13.
09:06It is possible to do this with an equals sign.
09:09R will run it, but that's considered poor style.
09:11In fact, there are several style manuals that have been written for coding in
09:16R. One of the more interesting ones is written by Google, which is nice because
09:19it's publicly available.
09:20It's short and it's very clear.
09:22I'm going to go to my browser and show you that one.
09:24We have Google's R Style Guide, which talks about ways to name files, it talks
09:29about indentation, and the brackets, about assignment, and I suggest that as you
09:34begin to write your own code in R, you take a few minutes and go through this, so
09:38you can write code that is more readable by others, and will make better sense
09:41for you, and run more smoothly in R.
09:43I'm going to go back to R now, and I'm going to come down to the bottom here,
09:48and clear the console.
09:49I don't need that information anymore.
09:50I'm going to hit Ctrl+L to clear it.
09:53Now, R is conceptually simple, and because it's command line based, you don't
09:56need a lot of menus.
09:57It can be very helpful to keep a few windows open simultaneously, such as we get
10:01to do here in RStudio, where we have the editor window, we have a Console window,
10:05we also have an indication of the variables that are active in the workspace, and
10:09we have access to information on packages, and help in the bottom right.
10:12R is a conceptually simple language, and it's conceptually simple program.
10:16Because it's command line based, it's easy to save the information here in the
10:19editor, and share it with others.
10:21I encourage you to take a little bit of time to look at the style manual, to
10:25find ways that you can write your own code to make it easiest for you to
10:29understand, and easiest to share with others.
Collapse this transcript
Reading data from a spreadsheet
00:00R is a flexible program that allows you to get data into it in many different ways.
00:05I'm going to start in RStudio here, and I'm going to open up the script that I
00:10wrote for this movie by clicking on open existing File, by going to Exercise
00:13Files, and opening Exercise 02_04.
00:17We're going to try opening a single dataset in several variations, through
00:21several different routes that researchers would commonly use.
00:25The simplest, but not necessarily the fastest way to get data into R is to enter
00:30it in directly using the editor.
00:33So, for instance, on line 4 here, I'm going to create a variable called x, and
00:37then I have the assignment, that's the arrow, and then I'm going to assign the
00:41numbers 0 through 10 into x. So, that's read as x gets 0 through 10.
00:47The cursor is there,
00:48so I'm just going to press Ctrl+Return on the PC, Command+Return on the Mac,
00:53and you see two things have happened;
00:54number one, in the console below, you see that now in blue it says that it has
00:59read this assignment, and has gone to the command prompt on the next line.
01:03On the top right of the window, under Workspace, under Values, you see that
01:07we've entered a variable called X as an integer variable with 11 values; that's from 0 to 10.
01:13Now, the next thing I'm going to do is on line 5 -- and when you run a command,
01:17the cursor in RStudio automatically goes down to the next one -- I'm going to
01:21just have a single letter here: x. That means I want to print the contents of x in the console.
01:27So, I'm going to hit Ctrl+Return on PC, Command+Return on Mac, and then you see
01:33we have 0 through 10,
01:34and the one in the square brackets on the right is the index number for the
01:38vector of the first item on that line.
01:40Now, there's only one line, so it's just going to be 1,
01:44but that's an indication that we have the response here.
01:46So, it's 0 through 10 on the contents of that one.
01:49So, that's one way to get data in;
01:51if you have sequential data, it's a super easy way to do it.
01:54Let's say, on the other hand, you don't have sequential data.
01:57You have a range of numbers, but they are different things, and they're not in order.
02:01Well, that's what I have on line 7.
02:03I'm going to create a variable called y, then I have the assignment operator;
02:06the arrow that's read as gets.
02:08And then I have c, which is for a concatenate, or you could also say
02:11collection, or combine.
02:12And then I have a series of numbers that I've entered, with a space in between
02:16them, and then a comment at the end that says assigns these values to y.
02:20The cursor is in that line, so I'm just going to press Ctrl+Return,
02:24and on the console, you see that it has read that command.
02:27And on the workspace on the top right, you see that I now have another variable,
02:30it's y, that has got numeric values, and there's 10 of them in this case.
02:34I'm going to run the command on line 8 of the editor, that's just the letter y, to
02:39see what's in it, and they are printed out in the same order that had appeared
02:43in when I entered it.
02:44I'm going to do one other here, and now you see that the cursor has moved down
02:49to line 10. That's ls. It's for list.
02:51It's for listing the objects.
02:52And if I enter that one, it's a way of seeing what's going on in the program.
02:56I'm going to hit Ctrl+Return, or Command+ Return on the Mac, and you see it tells
03:01me I have two objects there; x and the y.
03:03Now, that's the same information that's in the top right window under Workspace.
03:07And in fact, that's one of the nice things about RStudio is having this
03:10Workspace browser right there, so you don't even need to do this normally.
03:14Now, what I'm going to do is I'm going to try to read data from a CSV file.
03:18The idea is that most of the time when people have data, you're not going to
03:22want to enter it one number at a time, one line at a time in R. That's tedious,
03:27it's inefficient, and it's hard to get the structure of the data there.
03:30Instead, most of the time, it's easier to take data that's in the spreadsheet
03:33format, where you have rows and columns;
03:35one column per variable, and one row per case for individual or observation.
03:41The most common way of doing this is in an Excel spreadsheet, or some other spreadsheet.
03:45And while there are packages that are designed to make it possible to read Excel
03:49spreadsheets directly into R, I've found them to be rather cumbersome to use, and
03:53they don't always produce the desired results.
03:55On the other hand, the simplest way in the world is to use what's called a CSV
04:00file; a comma-separated value file, which you can create in Excel.
04:04In fact, to show how this works, what I'm going to do is I'm going to minimize
04:08RStudio for a moment,
04:09and right here on my Desktop you see I've got a folder that has the exercise
04:14file -- that's the script that I'm working on --
04:16and I have two data files.
04:18One is a Microsoft Excel spreadsheet, it's called social_network, and the other
04:23one is an SPSS data document.
04:26I'll get to that one in a minute.
04:27Because in R you're going to have to give references to the specific file
04:31locations, it's often easiest to move these things to the Desktop, and that's
04:34what I'm going to do right now.
04:36I'm going to grab both of these files, and just slide them over to the Desktop.
04:40I'll put them back into the folder afterwards.
04:42Then I'm going to open up the spreadsheet by double-clicking on it, which
04:46brings it up in Excel.
04:47Now, here's the spreadsheet.
04:48What we have in this version is 5 columns.
04:51The first is an ID number, the second is Gender of the respondent, the third
04:55one is the Age of the respondent, and then the fourth and the fifth have to do
05:00with the subjects of the survey, which was about people's preferred social networking sites.
05:04This is from about 3 years ago.
05:07And then the last one is how often they say they log in to that site each week.
05:11Another thing to notice, and this is significant, is that we have missing data in this one.
05:16So, for instance, cell E6, it's right up here;
05:20the person said they did not have a preferred site, and they did not provide
05:23a number for times.
05:25Also, in cell C8, this person didn't provide their age.
05:29Now, this is important, because while it's true that most statistical analyses
05:33are easier if you have a complete data set, it's also true that complete
05:37datasets are not always the case.
05:39And so, I wanted to use this one, because it shows some of the things that you
05:43can do when you have incomplete data.
05:45The first thing that I'm going to do is I'm going to take this file, and I'm
05:49going to save it as a CSV file;
05:51that's comma-separated value.
05:52I am going to come up to File, and go to Save As.
05:56When that comes up, I'm going to move to the Desktop, because I want to save it
06:00to the Desktop, and I'm going to come down to Save as Type where it currently
06:04says Excel Workbook, I'm going to click, and go about halfway down to this one
06:07that says CSV Comma-Delimited -- comma-separated, or comma-delimited.
06:12Now, you also have a choice of saving it as a Tab Delimited Text file, that's
06:17this one right here, in which case it would be a .txt file. That introduces some
06:22extra complexities in getting things into R in terms of you simply have to be
06:26explicit in terms of whether you have missing values, and what the separators are.
06:30I find it easier to just use a CSV.
06:33So, I'm going to come back to CSV, I'm going to click on that, and save it to the Desktop.
06:38I can just go right ahead;
06:39it's true, it's going to lose some of the formatting.
06:42And I'm going to close that file, say Yes, and Yes, and minimize Excel.
06:49And now you see that I've got this file right here.
06:51This is an Excel CSV file.
06:54Now what I can do is I can open this one up in R.
06:57I'm going to go back to RStudio now, and I'm going to show the next few lines in the script.
07:03It says CSV files.
07:04The first thing is that R takes missing data, which in Excel, or in SPSS is just a
07:08blank, and it replaces it with NA for not available.
07:12Because we're using a CSV file, you don't have to be specific about the
07:15delimiters for missing data.
07:16You don't have to say that if there are two tabs in a row that's missing.
07:20Also, CSV stands for comma-separated values.
07:23Another thing that I have to put into this command is I have to specify that
07:28there's a header across the top that has the names of each of the variables.
07:31Sometimes you'll have those; sometimes you won't.
07:33If you do, you need to tell it that you have those, so it doesn't try to read
07:37them as regular values.
07:38And then there's an issue here with backslashes in Windows PCs.
07:42Let me show you this first command.
07:44I'm going to go down to line 18, and what I'm going to do is I'm going to take
07:49this spreadsheet, and I'm going to read it into what's called a data frame.
07:53You can just think of that as a matrix that holds data, although matrix and data
07:57frame are actually different, because in a matrix, everything has to be the same
08:00data type, but in a data frame, the columns can be of different types.
08:04But I'm calling it sn, for social network, .csv, because I'm using a
08:08comma-separated value file.
08:10That is the dot there.
08:12Now, a lot of people associate that with a method for an object.
08:16The Google style manual for R that I showed you suggested that you use a dot to
08:20separate words in variable names, and data frame names,
08:23so that's what I'm doing here.
08:25So, I'm creating a data frame called sn, for social network, .csv, then I have the
08:31operator, the arrow and the dash; it means gets.
08:34And then I'm using the function read.csv; that's a built-in function here, and
08:38then I have to specify the path.
08:40Now, normally in a Windows computer, the path looks like this, and unfortunately,
08:44paths get really long, and I'm being explicit about the entire path,
08:48so I have C:\\Users\\Barton Poulson\\ Deskotp\\social_network.csv, and then I
08:55have this little thing header = T; header = true.
08:59There is a header in there.
09:00Now, the problem is if, I run that one -- I'm just going to have the cursor right
09:05here, and I'll hit Ctrl+Return here on my PC,
09:07and watch what happens.
09:08I get an error message, and that's because in R, when it gets a backslash, it's
09:12trying to read that as what's called an escape character that it uses for
09:16reading special characters, like line returns, or quotation marks.
09:21And so, there's two ways of dealing with that.
09:23One is either you double up the backslashes, so what you're actually doing is
09:27you are called escaping the backslash.
09:29So, the first backslash says something is coming that I need you to read a
09:33special way, and the second one means it's a backslash.
09:36If I do that, let me come down to line 20 here, and press Ctrl+Return on the
09:41PC, and I get this.
09:43Now, you see down at the bottom it says that it's read it, and if you look after
09:48the right in Workspace, now at the top, we have data,
09:50and it says sn.csv; again, that's for social network CSV, and it's 202
09:55observations on 5 variables. So, it's read it.
09:58The other option is here on line 22, and what this one uses is forward slashes.
10:03Now, Macintoshes use forward slashes, but I wouldn't have the C there for the Mac.
10:07But you don't have to rearrange things on a Mac, because the forward slashes
10:10are readable, and by using the forward slashes even in the Windows PC path, it can work also.
10:16So, I'm going to hit Ctrl+Return,
10:18and you see down in the console that it read that one as well.
10:22It had the exact same name, so it just overwrote the same dataset in the workspace.
10:26I'm going to use this one other little command here: str. That is for structure.
10:31And structure is a nice way to double check that things got entered the way
10:35that you wanted them.
10:36So, you put the name of whatever it is you're checking right after it.
10:39I'm doing the structure of this data file.
10:41So, I'm going to hit Ctrl+Return,
10:43and what it tells me is I have a data frame.
10:46I'm looking down in the console with 202 observations of 5 variables, and it
10:50tells me what the variables are.
10:51It tells me what the possible values are, and runs off the first several values,
10:56and so that's a good way of seeing what's going on.
10:59So, that's how you want to read data from an Excel spreadsheet; by saving it as a
11:03CSV, and then using read.csv to get it in after you've made any necessary
11:08accommodations to the file path address.
11:10I also have information in this script about how to read data from an SPSS file,
11:14and we're going to look at that one in the next movie.
Collapse this transcript
Reading data from SPSS
00:00In our last movie, we looked at how to get data out of an Excel spreadsheet, and
00:04into R through a CSV file; a comma-separated value file.
00:09In this movie, I want to pick up and talk about how to get data out of an
00:12SPSS file, because that's a very common statistical package used by a lot of researchers.
00:17Now, I'm going to continue with the same script that I have open, that's SPSS
00:21right here at the bottom on line 25.
00:23I'm going to scroll up a little bit.
00:26Now, there's a couple of different ways of dealing with SPSS in R. The one that
00:31I'm going to recommend is actually to use the exact same procedure we used with
00:34Excel; to save it as a CSV file, and then to import it using the read.csv.
00:40I find this to be the simplest, and most straightforward, and have the fewest errors.
00:44The way that you want to do this is by opening it up in SPSS, and then using the
00:49special saving function that they have there.
00:51I'm going to minimize R, and we will get into SPSS to do that.
00:55I'm just going to come down to my data file, and double-click on that.
01:02What you see is that this one looks a little different from the Excel spreadsheet.
01:06It has an extra column.
01:08The first column is the ID number.
01:10The second column is the Gender of the respondents, and that's written as text.
01:14In SPSS, that's referred to as a string variable.
01:17The third column here, though, is redundant with that one.
01:19It's called Female, and it's written as 0s and 1s.
01:23I've done this because very often in SPSS, the practice is to enter even
01:27text variables as numbers, and then put labels on top of those; associate them with those.
01:33Now, I personally like to use 0, 1; an indicator variable for gender where 1
01:38indicates the person that is of that specified gender, 0 indicates they're not,
01:42because I find it a lot easier to read those results for correlation
01:45coefficient for regression.
01:46And which one is 0 and which one is 1 is completely arbitrary.
01:51The first case in this one was a male, so they got a 0; the second one was
01:55female, so they got a 1.
01:56You can see that I have variable names that go over them.
01:59If you come up to the bar, and click on the fourth from the right, this one right
02:04here, you see that it says Value Label.
02:06If I click on that, then you see that the Female, the 0s and 1s have male and
02:10female that goes over the top of them.
02:12So, I'm going to go over to file, come down to Save As, and then from there,
02:19I go to Save As Type.
02:20Now, right now it says the default .sav.
02:23I'm going to come down to Comma delimited, that's .csv.
02:30Then you see up here that, that's the existing one that I created in Excel.
02:34In order to not overwrite that, I'm actually going to change the name here
02:38slightly, and add _.spss.
02:41Then I'm going to click Save, and minimize SPSS.
02:48Now, the second CSV file is the one that we just created in SPSS.
02:52What I'm going to do is I'm going to go back to RStudio now, and I'm going to
02:57run this command right here; it's sn, for social network, .spss.csv, and I'm
03:03going to use the read.csv command.
03:05You see it's mostly the same.
03:06I'm going to just scroll to the end here, and all I need to do is give the
03:11exact file path, and then I need to specify that it has a header for the
03:15variable names at the top.
03:16Go back to the beginning.
03:18And I'm going to run this one now. Just hit Run.
03:20Now you see that it's run, and then in fact, on the right, I now have data.
03:25I have sn.spss.csv and that's worked as well.
03:28I could run the structure to see exactly what it looks like.
03:30Just run that command, and that gives me a description of what it's like.
03:35The CSV I find to be the easiest and most direct way of doing this.
03:39There are actually several packages of code that have been developed to read
03:42files like SPSS files directly into R without translating them into CSV files first.
03:48One of the most recent is called foreign, for reading foreign formats.
03:53There's something interesting that happens.
03:54I'm going to scroll down here.
03:56Now, a package, known as a library in most other forms -- we're going to talk more
04:01about packages in the next movie.
04:03The thing is, it's a little bundle of code that adds functionality, but it has to be installed.
04:07So, the first thing I'm going to do is I'm going to install it, and this is
04:12actually going to download it, and put it into R.
04:14So, I'm just going to run that line; number 32.
04:17Then you see on the bottom, I got a bunch of text in the console that says that
04:21it ran that command, that's in blue, and then it installed it in the red, and
04:25then gave me just some final results in the black.
04:28Plus, if you look in the bottom right here under Packages, it's now installed
04:31one here called foreign.
04:32You notice it doesn't have a checkmark.
04:35Now, I can check it myself.
04:36I can do that manually.
04:37But in order to keep a record of everything, it's nice to do that with the script, I
04:42come back up to the script, to line 33, and I say library(foreign).
04:46That's going to load it.
04:48When I run that, you can also see those checkmarks come on on the bottom right.
04:52Then I'm going to use its own special format here; sn.spss.
04:58Then I have the .f to say I'm using foreign.
05:00That's just for me.
05:01Then I have the function as read.spss.
05:05Then I gave the file path.
05:06At the end, I have to specify two extra things.
05:09One is to.data.frame.
05:11That is, I'm taking this SPSS file, and I'm saving it as a data frame.
05:16That's how we store all our data. And also that I wanted to use the value labels
05:20instead of the numbers for the numeric variables that have value labels.
05:24I go back to the beginning, and I'm going to run that line, and
05:27you can see on the workspace on the top right, I've now added another one;
05:31sn.spss.f. There it is right there.
05:34I'm going to run the structure on that one.
05:36Now, you can see that's all loaded the way that I wanted to also.
05:39So, this one is good, it works, but I generally get warning messages.
05:45The warning message is not problematic.
05:47It still went ahead and loaded it.
05:48It still did it the way I wanted.
05:50On the other hand, I'm more comfortable using the CSV, because I don't even have
05:54to install or load a package of code in order to do this.
05:57And so regardless of how you get your data into R, either by using a CSV file
06:03from Excel, or from SPSS, or by using a package like foreign to read it in, you're
06:09going to have a lot of opportunities to work with that, and that's what we will
06:13discuss in the next chapter.
Collapse this transcript
Using and managing packages
00:00R is a very powerful and flexible program, even with its default installation.
00:05The beauty of R, though, is it can go so much further than its base version by
00:10adding packages or bundles of code that add functionality to R. At the moment,
00:14the Comprehensive R Archive Network or CRAN package repository lists over 4000
00:21packages for R, all of which can be freely downloaded and installed.
00:25The creativity and functionality of these packages is astounding, leading many
00:30people such as myself to tell others that R can do anything.
00:35In this movie, I want to show you how to find out about packages, how to install
00:39them, and how to use them in R.
00:42The first thing to do is to find out about packages that are available.
00:45On the bottom right of the screen here, this is one of the nice things about
00:50RStudio, is you have a list of packages that are already available.
00:53We start from the bootstrap functions under boot, to classification, and it goes
00:57all the way down to the utilities.
01:00These are ones that are installed, but it doesn't mean that they're loaded at the moment.
01:04The checkmark means that they're currently loaded, so the utilities and the
01:07stats are the ones that are loaded in this particular window.
01:09Let's take a look at what some of the options are.
01:12I'm going to go to line 6 in the editor window here, and browseURL; this opens
01:18up the URL in a Web browser.
01:19I'm just going to run that line, and it's going to open up my default browser, and
01:24there you have a large list of categories of packages that are available. And
01:29CRAN, again, stands for Comprehensive R Archive Network.
01:33You can pick a field that you're interested in; say, for instance, under graphics,
01:38and a huge number of choices.
01:39One of the most popular, by the way, is this one right over here: ggplot2. That
01:43stands for the grammar of graphics.
01:46That's a book, and this was written to be based on that book.
01:50It's an incredible package.
01:51I'm going to go back to R. That's a list of topics.
01:55You can also see what's available by name.
01:57In this case, I'm going to go to a specific mirror;
02:00the one that's at UCLA.
02:02I'm going to run this line.
02:04Here we have a very long list of packages.
02:07This is going to be the 4000 packages that are available.
02:13And pretty much everybody should be able to find something of utility for them in here.
02:18The next step in line 9 is to bring up the editor list of the
02:21available packages.
02:22So, those are going to be the ones that I have already.
02:25I'm going to just run that line.
02:27What this does is it brings up a text file in an editor window; we see right
02:31up here, and this mirrors a lot of what's over on the right, except it does
02:35show ones that are invisible, like the base that you couldn't turn on or off if you wanted to.
02:40Close that, and say, what about the packages that are currently active?
02:43That is, the ones that are already checked.
02:45I can do that with search.
02:46Just run line number 10, and then in the console, it shows me the packages that are there.
02:52It's got 11 listed.
02:53Again, not all of these show up, because some of them are invisible, like the
02:57global environment, but also the ones that are checked off on the right,
03:00you'll see in this list.
03:01Now, if I want to install a new package, say I found one that I really liked,
03:06there are a couple of ways to do this.
03:09For instance, you can come up to the menu, to Tools, to Install Packages.
03:16It brings up this menu; that's one way to do it. Or, you can use the Packages
03:19window here on the right, and just click the one that you want.
03:22But personally, I find it easy to use scripts, and one of the reasons for that
03:26is that it makes the procedure repeatable for other people.
03:30And also, it means that you can run them in larger source scripts, and they
03:33can run automatically.
03:35Now, one that I like is called psych.
03:37And what I'm going to do is I'm going to run this line on number
03:4118; install.packages.
03:43That's the command to download the package.
03:46Then you have to put the name of it in parentheses, and quotation marks. I am
03:50gong to run that line; it's going to download the package.
03:54You see that's what we have here on the bottom left in the console.
03:57There's all this text, and it says it ran the command, it downloaded the
04:01package, it's been installed,
04:02and in fact, if you go to the Packages list on the right, and come down, you'll
04:09see that psych is now installed.
04:11It doesn't have a checkmark, because it hasn't been loaded.
04:14That's a separate procedure.
04:15So, what I'm going to do is I'm going to come to line 20, to library("psych").
04:20Now, please note, the quotation marks in library are not necessary, but Google
04:25suggests them as a good format.
04:26It's consistent with installing.
04:28You use the command library to make a package available when you're loading it
04:32in a script, like I am right now.
04:34On the other hand, if you've created a function or a package, sometimes you
04:37use instead require.
04:39Both of them have the same effect of loading the code that's in the package.
04:44I'm just going to use library, because that's the one that I use in scripts.
04:47So, I'm going to run that line, and then you see in the console that it ran
04:52library("psych"), and then you see in the window on the bottom right that I now
04:55have a checkmark next to psych.
04:57Require would do the same thing.
04:58Now, if you want to see the documentation, you can just come down here.
05:02I put library(help = "psych").
05:04That lets it know what I want the help on. I run that line,
05:08and it brings up a window in the editor.
05:10It has a text description, and it has a lot of the information about what goes into it.
05:15It's pretty lengthy.
05:18But you can get even more, and in a different format, if you try a
05:21different approach.
05:22Instead of just doing this one, a lot of programs, and psych is one of them,
05:27have what are called vignettes, and these really are just examples of how to use the package.
05:32So, what I'm going to do right here is I'm going to come to line 28, and I'm
05:36going to use the command vignette, then I'm going to specify it's for the
05:40package psych, so package = "psych".
05:43And if I run that, it brings up an editor window with not much in it.
05:48But if I do a small modification, and say I want to browse vignettes, that's
05:54going to open it up in a browser.
05:56It's going to look like this.
05:57Now, what I have is PDFs, and R codes, and LaTeX.
06:02I can hit on the PDF here, and now I can see a PDF that is nearly 100 pages of
06:09documentation on how to use the psych package.
06:13That can be downloaded and saved.
06:15It can be searched.
06:16It's a wonderful thing.
06:17I'm going to go back to R. You can also bring up a list of all of the vignettes
06:21that are available in all of the packages that are currently installed in R.
06:26That's just vignette().
06:28I'm going to run that line, and here are all the ones;
06:32we have displaylist, sharing, matrix,
06:37and just as we did with the psych vignettes a moment ago, if you want to have
06:41interactive hyperlinked version of this, you just use browseVignettes(). Now I
06:48have the documentation for nearly everything, including, for instance, Sweave.
06:51Now, once you have packages installed, it's important to remember that
06:54everything gets updated frequently in R, and so you're going to want to get
06:59things updated, including your packages that you use.
07:01In RStudio, there's a few different ways to do this.
07:04You come up to Tools to Check for Package Updates.
07:07You can do it there.
07:09You can also come over here, and just click on the green circle to check for
07:12updates, or you can just run this command: update.packages().
07:16Run that one,
07:17and it lets me know that there are some updates.
07:20Cancel those for right now.
07:23Then finally, if you have a package that you no longer need, you have the option
07:27of simply coming over here to the window, unchecking it, and then clicking on the
07:31X to get rid of it if you want, or you can also use this one: detach.
07:36That will also remove the package so it's no longer active.
07:39I'm just going to run that line.
07:41Now you see that the checkmark next to psych has disappeared,
07:44and if I want to get rid of it entirely, I just click on the X.
07:48Anyhow, that's one way that you can add extra functionality to R, and to give
07:54you some more of the flexibility and power to do almost anything that you need to do.
07:59And again, like R itself, these are free, they're open source, and they can make
08:03your analytical life much easier, and much more creative.
Collapse this transcript
3. Charts and Statistics for One Variable
Creating bar charts for categorical variables
00:00Once the data are entered into R, the first task in any analysis is to examine
00:05the individual variables.
00:07Now, the purpose of this task are threefold:
00:09first, to check that the data were entered correctly; second, to check whether
00:14the data meet the assumptions of the statistical procedures that you've planned
00:17to use; and third, to check for any potentially interesting, or informative
00:21observations, or patterns in the data.
00:24For a categorical variable, such as a respondent's gender, or a company's economic
00:28sector, that is, a nominal or an ordinal variable, the easiest and most
00:33informative way to check the data is to make a bar chart,
00:35and so that's where we turn first.
00:37The unfortunate thing about R is that it's not really set up to do bar charts
00:41from a raw data file.
00:43It wants to do them from a summary data file, where you say, this is the
00:47category, and this is how many people are in that category.
00:50On the other hand, if you have raw data, where you're simply listing category
00:541, 2, 1, 1, 2, 2, 2, there's an easy way to work around it, and that's what I'm
01:00going to show you here.
01:01I'm going to be using the social network data that I've used before, and I'm
01:05going to get that loaded.
01:06The way I'm going to do this is I'm going to use the same read.csv function
01:11that I've used before. That's because I'm dealing with a comma-separated values
01:16spreadsheet, and I'm going to feed it into a data frame called sn, for social network.
01:21I am going to set it up a little bit differently, though, because you may recall
01:25in the previous versions, I specified explicitly the entire file path from C on.
01:31I want to use a shortcut version.
01:33I am going to show you how to set that up.
01:35If you go up to Tools, down to Options, one of the choices you have in the
01:40General window is the Default working directory;
01:43that is, when you're not in a project that explicitly puts it somewhere else.
01:47Even though we have a little tilde here, this actually is currently going to
01:50my Documents folder,
01:52but I'm going to go to Browse, and I'm going to change it temporarily to the
01:56Desktop, because I've copied the files over to the Desktop.
02:00Then I put Select Folder, and now you see it has the C:/Users Barton
02:05Poulson/Desktop, and I can just press OK.
02:08And now I can just have a very short version, where I give just the file name
02:13without the entire file path.
02:15I still need to use the read.csv,
02:17I still need to say that I have a header, but otherwise it's more
02:20abbreviated than that.
02:21So, I'm going to read that in right now, and now that's loaded in, we can move
02:26on to the next part.
02:27You see in the console that it ran, and you see on the top right under workspace
02:31that I now have a data frame, sn, 202 observations with 5 variables.
02:36What I have here now is a bunch of comments that R doesn't work with raw data;
02:39it can't do it directly from the categorical variables.
02:42We first have to create a table with frequencies,
02:45and I'm going to use a table function to do this.
02:49In line 25, this is where I create the table.
02:52What I do is I specify the name of the new table, and that's going to be
02:57site, because I'm looking at the Web sites that people say are their primary
03:03social networking sites;
03:04.freq for frequency.
03:06And then I have the assignment operator, gets, and then table is the function.
03:11And then I am specifying in the parentheses the data set, sn, that's my data
03:16frame, with the dollar sign;
03:18I use that to specify which variable I'm using to create the table.
03:22In this case, I'm using site.
03:24Please note the capitalization.
03:26R is capitalization sensitive.
03:28You've got to make sure that the capitalization is the same all the way through.
03:31So, I'm going to run that command,
03:33and now you see it ran down in the console, and on the right, I now have values.
03:38I have a table now with 6 values in it.
03:41What I'm going to do now is create the default bar chart.
03:44This is one where I simply take a barplot, and I just run it exactly as it is.
03:49So, that's barplot, and then you put the table in there, site.freq, and then I run that one.
03:56In the bottom right here, you see that it's opened up, and there are a few
03:59things that are going on.
04:00Number one is it's gray.
04:02It doesn't have any titles.
04:03There's only every other label.
04:05The scale only goes up to 80, and there are some other issues.
04:09You can see it bigger if you want to.
04:11Just come down and click on Zoom.
04:14Now it fills up the whole space, and you can see all of the labels.
04:17There are a lot of options within barplot that allow you to control the color,
04:21the font, the orientation, the order; a ton of things.
04:25I'm actually going to take just a second here to show you how you can find
04:29out more about that.
04:30I've got, here on line 28, the question mark, a space, and then barplot.
04:35This is how you find help on any of R's functions, and I'm just going to run that line.
04:41Now you see it brings up the Help window here that talks about all the
04:45functions and the options available in barplot.
04:47And so, I'm going to show you a few of these.
04:49I'm not going to run through all of them, because there's an enormous number,
04:53especially because barplot feeds into some other more general options, such as
04:58this one here that talks about graphical parameters, which gives you just an
05:02incredible amount of control of things you want to specify.
05:06Mostly I want to show you just this very basic one,
05:09and I'm going to make a few variations on it.
05:12The first thing I'm going to do, and I think it's really important, is to put the
05:16bars in descending order.
05:17Unless there is some sort of inherent and necessary order in your data, a
05:20descending order is a really convenient way to do it.
05:23The way to do that is actually I have to tell it that I'm going to be drawing a
05:27barplot, and I'm going to be using this data,
05:30but I want to order it according to this variable, because theoretically
05:34you could order it according to a different variable, and then I'm going to
05:38use a decreasing order.
05:40So, decreasing = True.
05:42So, I come over here, and I'm going to run this line,
05:45and now you see that it's in decreasing order. That's good.
05:48And if you want to see it bigger, what we have here is a lot of people who
05:52reported using Facebook.
05:53The next biggest was people who said they used None, but they still answered the survey.
05:57Then, you can tell this is a few years older, because we have people saying they
06:01used MySpace, and then we have LinkedIn, and Twitter with just a couple of
06:04people each, and I'm willing to bet that all those things have changed since
06:07this data was first gathered.
06:08I am going to close that window.
06:11Now, it's better that it's in order, but we still have an issue of the labels,
06:15and the scale is not long enough, and we have no titles.
06:18I'm going to show you some of these other things.
06:21What I'm going to do first is I often like to put bar charts horizontally,
06:26because then the scale is in the same direction that it is on a lot of other analyses.
06:30So, what I do then is I'm going to do barplot, and I'm still going to order
06:34them, except I'm not doing them decreasing, because it needs to be increasing
06:38when you're dealing with horizontal, because it starts at the bottom and goes up.
06:41But this time I have horiz, or for horizontal = True.
06:44So, I'm going to run that command,
06:47and now I have a horizontal one, but
06:48you see I lost even more of the labels.
06:51Now, I also want to do something about the color here.
06:53For instance, Facebook has a distinctive color of blue associated with it, and
06:57so it would be nice to highlight it with that color.
07:00So, what I'm going to do is I'm going to come down here, and I need to create a
07:04vector; a collection of color specifications.
07:08And the way I do that is I first give it a name.
07:11So, it's like a new variable; a new data frame.
07:14I'm calling it fbba, for Facebook blue; fbba, and then a, for ascending, because
07:21if I were doing this as a vertical bar chart, I need to go descending.
07:25Then I have the assignment operator, and that's the arrow, and then c is for concatenate;
07:30sometimes collection, or combined.
07:32And then, I'm going to have six colors in here.
07:36Five of them are going to be identical; they're going to be gray.
07:39And so, I could write gray, gray, gray, gray, gray, or I can use this other
07:43option; that's rep, and that's for repeat.
07:45And what I do is I put down rep, and then I put in parentheses what it is I want
07:51repeated, and I want the word gray in quotation marks repeated.
07:55And then after a comma, how many times I want to repeat it, and I want it five times.
08:01Then, after the comma, I can put the last color that I want, and I am going to
08:05do that one in particular way.
08:07First off, in order to get the Facebook blue, I want to specify it exactly, and
08:11I've got what are called the RGB codes; the red, green, blue codes.
08:14And that's 59 for red, 89 for green, 152 for blue,
08:17but I also need to tell R that I'm working on a 0 to 255 8-bit color scale.
08:23And so, that's what the maxColorValue is for, and then I finish the command.
08:27This is also the first time, I think, that I've broken code across two lines.
08:32The reason for that is this is a long line of code, but it's all a single
08:36command, and so this is one way of making it easier to follow, by breaking it into pieces.
08:41So, I'm going to highlight both of those lines, and then hit Ctrl+Return to run them.
08:46Now I'm going to do a modified version of the barplot, where I'm adding this
08:52bottom line here that says, col, that's for color, and I'm saying use the
08:57vector fbba, and I'll highlight the whole thing, and I'm going to run it.
09:01And you'll see that in my chart on the bottom right, the top one, which is
09:05Facebook, turned blue.
09:06Now, it doesn't say Facebook, because it's small.
09:09If I click on Zoom, then you can see that it's Facebook.
09:13There are some other issues with this chart.
09:14Number one, I'd like to turn off the borders around the bars.
09:17Also, I need titles;
09:19I like to have a subtitle.
09:21The scale on the bottom goes from 0 to 80, but the bars go farther than that,
09:25so I'd like to change it, so it goes up to 100.
09:28I happen to know that the maximum value is just under 100.
09:31And that's why I'm adding several other arguments to this function.
09:36So, this is the same barplot function, and I'm making a chart of the site frequency.
09:40I'm going to order it by site frequency, and this one says make it horizontal.
09:46This one says use the Facebook color vector.
09:50Borders = NA; that means no borders at all. xlim; that's the limits for the X.
09:56This one needs be its own little vector, and so I have c, for concatenate, and I
10:01say it goes from 0 to 100.
10:04And then I have one that says main, and that means the main title.
10:08That one is kind of long.
10:10I didn't want to break it across.
10:12So, let me scroll through here.
10:13And what I'm saying is Preferred Social Networking Site,
10:15and then the \n is a way of inserting a line break in the middle of it.
10:20So, there will be a second line to this one that says, A Survey of 202 Users.
10:25Then xlab at the bottom means the label for x that's going to appear
10:28underneath the scale.
10:30So, when I highlight all of those lines, and run them, you see now the borders
10:36have gone away, the scale has extended to a 100, I have a title on the top, and
10:40I have a scale label on the bottom.
10:42If I make this bigger, you can then see all of the site names.
10:47If I wanted to spend some more time on this, I would turn the labels, the
10:51Facebook, and None, so that they were horizontal.
10:54I would probably move Other and None down to the end.
10:57There are a lot of other things that I could do here.
11:01That's why you want to be able to explore the options that come through boxplot;
11:06that's why I had the question mark, space, boxplot.
11:09And then also the parameters that are the general graphics parameters.
11:12They give you an immense amount of control.
11:14You can basically make this do whatever you want, but this is an example of some
11:18of the modifications that are possible.
11:22There's just one other thing I want to show, and that's how to export these
11:25charts, because right now it's a chart that's just inside R. You see right here,
11:29we've got a really easy thing. It says Export.
11:31This is one of the advantages of using RStudio.
11:34I can say, for instance, save it as a PDF, and I can tell it how big I want it.
11:39Let's say I want it to be 8 inches by 6 inches.
11:43Then I can give that file a name: snPlotpdf.
11:50One of the great things about RStudio is that it gives you options for
11:53exporting your graphics.
11:54So, for instance, let me zoom in on this graphic.
11:57We've got what we need there. I'm going to close it, and I can export it as a PDF.
12:03And that's something that the regular version of R does, but also, I can save
12:06the plot as an image, and I have a lot of choices here, from PNG, JPEG, TIFF, and so on.
12:12I can choose my own width and height, which is hard to do in a regular version
12:15of R. I can view it after I watch it, and make it big enough so you can see all the labels.
12:21Anyhow, I'm just going to press Cancel right now.
12:23The idea here is that you have a lot of control over these bar charts, and
12:27that RStudio in particular gives you a lot of options for exporting and
12:30sizing your charts.
12:31That is really one of the first things you want to do when you're dealing with a
12:35categorical variable is to make the chart so you get a feel for your data to see
12:39how well you meet the assumptions, and to see whether it got entered correctly,
12:43and to lead in to the later analyses that you're going to do.
Collapse this transcript
Creating histograms for quantitative variables
00:00In the last movie, I started by saying how important it was to screen the
00:04variables as you enter them by making charts as a way of checking that you
00:09entered them correctly, that you are meeting the assumptions of the statistical
00:13procedures that you intend to use,
00:15and a way of giving you an idea of what's interesting or unusual in your data set.
00:19We looked at bar charts, which are good for categorical variables. When you
00:22have a quantitative variable, something that's an interval or ratio levels that
00:26has been measured, like age, or time, or income, then you want to use a different approach.
00:32The two most common forms of graphics you want to use in that case are
00:36histograms, like bell curves, and box plots.
00:39In this particular movie, we're going to look at histograms.
00:42Now, the nice thing about histograms is that, unlike box plots, R has a built-in
00:47function for this one that does not require you to do any sort of pre-parsing of the data.
00:51I'm going to use an example here of the social network data that I've used before.
00:56I'm just going to scroll down here, and read in that data set.
01:00You can see on the workspace I've got a data frame, that's sn, for social network.
01:05It's got 202 observations with the 5 variables.
01:08And then I just come right down here, and I'm going to make a histogram of the
01:13variable of age, so I'm going to look at distribution of the age of
01:16respondent, so I use hist, that's the function, and within the parentheses, I
01:21specify the data frame, that's sn, and then the dollar sign, and then I give the variable name.
01:28Now, I should mention, it is possible to use something in R; a function
01:32called attach, which means you attach a data set, and then you can refer to it
01:37in a short-handed way.
01:38You can just give the variable names, because it knows you're referring to that
01:40particular data set.
01:42The problem with attach is it really sets the stage for a lot of really
01:46unfortunate errors, where you have more than one data set open, and that you get
01:50confused about what's doing what.
01:52And so, for instance, when I talked about the Google Style Manual for R, they
01:56just said don't use attach ever.
01:58So, what I'm doing here is I'm explicitly saying what the data frame is, and
02:03what the variable is.
02:04Anyhow, I'm going to make a histogram of age, and all I have to do is run that
02:08one line on line 15.
02:10There we have the default histogram.
02:13You see, for instance, it says histogram, and then it gives my funny title there
02:16on the top, and runs it again at the bottom.
02:18And this is sort of an outline version of what we have.
02:21I'm going to make just a few modifications to this; not very many.
02:25I'm going to come down here, and what I'm going to do is I tried once removing in the borders.
02:30You can do that, but it looks silly, so I left that out.
02:33I'm going to change the color to a beige color; actually, a very light color.
02:37It shows that light beiges and yellows are good at getting people's attentions
02:41without being overwhelming.
02:43You can specify colors in a few different ways.
02:45This one is a named color, so I put col, for color, and then in quotes I put the word beige.
02:51That's referring to a specific one.
02:53There's another way to refer to it, and that is colors in R also have numbers
02:58from 1 to 657, I believe, and the beige is number 18.
03:04The way that you would specify it in that case is with this line.
03:08I would put col, for colors, and I say referring to the colors, the set of
03:12colors, and then in the square brackets, I just say index number 18.
03:16That would get the same color, but I'm going to make it beige, and then I'm going
03:21to put a label on the top of title.
03:23That's main, that's for the main title, and it's a long one, so I'm just going to
03:27scroll to the end here for a moment.
03:29And the backslash n breaks it into two lines.
03:31I'm going to back to the beginning, and then I'm going to have an X label at
03:36the bottom that I'm going to put underneath the age, where it's just going to
03:40say age of respondents.
03:41So, what I do now is I highlight these lines, and I run those.
03:44Now you'll see I have a little bit of a fill, a bit, just to make it pop out a tiny bit.
03:50I have an interpretable title at the bottom.
03:53I've got a label under the age that makes sense, and that's really enough
03:57for what I need to do.
03:58That a functional, useful histogram, and again, like box plots, there's about a
04:02million options that you can have in terms of modifying a histogram in
04:06particular, and the graphics parameters in general.
04:08You can explore those, but this is sufficient for getting started.
04:12By the way, I just wanted to add something about R's color palette.
04:16If you want to, you can actually see the palette by going to this Web address.
04:20I'm going to copy that, and I'm going to go to a Web browser, and we get a large chart.
04:26This is just the beginning of it that talks about what all the colors are.
04:30If you click on the PDF, it's several pages long. It gives the numbers for colors,
04:35and then sorts them, and then gives the individual names for each one of them.
04:39For instance, there's the beige that I used just a moment ago.
04:42You can also get the hex codes, and the RGB codes if you want for that.
04:47I'm going to go back to R now, and just show one other thing.
04:50By writing colors, that refers to the array; an 18.
04:53If I run that line, see what it does down here is
04:56it says that colors, number 18 is beige, and then I can to also specify several
05:01by putting them in a concatenated array.
05:04When I do that, I run that, and it tells me the colors of each one of those
05:08numbers that I put in.
05:10Anyhow, those are some of the options that you can use in customizing your
05:13histograms as a way of exploring the quantitative data, and getting you ready
05:17for further analyses.
05:19In the next movie, we're going to look at another chart that is very useful for
05:22quantitative variable, and that's the box plot.
Collapse this transcript
Creating box plots for quantitative variables
00:00In the last movie, we looked at how you can use histograms as a way of checking
00:04the nature of a quantitative variable to see whether it got entered correctly,
00:09to see whether it meets the assumptions of the statistical tests that you're
00:13going to perform, and to look for interesting or potentially informative
00:16observations within that variable.
00:18Another graph that I always create when I'm looking at quantitative variables is a box plot.
00:24A box plot is a shorthand way of looking at the distribution.
00:26It highlights outliers, and it gives you an idea for what might be unusual or
00:31exceptional in a distribution.
00:33In this particular data set, I'm going to use the same variables that I used in the last one.
00:39I'm going to open up the social network data again, and then I'm going to
00:43come down to boxplot.
00:44Again, the nice thing is this is a built-in function, and it doesn't require any
00:47preprocessing the way that we had to do with the bar charts.
00:50All I do is I say I want a boxplot, and then I'm using the data frame, or the data
00:55set sn, and the variable Age in that one.
00:59I'm just going to click that.
01:01By default, it makes them horizontal, and there are no labels.
01:05However, you can see that the median age -- that's the thick line through the
01:09middle of the box -- is around 30, and we go down to below 10 years old, and up
01:13to about 70 years old.
01:15I'm going to make a few quick modifications of the boxplot.
01:17Let's scroll down here.
01:20The first thing I'm going to do is I'm going to put some color in it.
01:24I'm going to use the beige again.
01:25It's enough to make the boxplot pop off the page, but without being overwhelming.
01:29I'm also going to add notches to the box plot.
01:32That's a way of actually doing a sort of visual inferential test for the medians of boxes.
01:37I'm going to make it horizontal, because I like to have it in the same scale as
01:41the other variables that I use.
01:43I'm going to add a title across the top with main, and it's going to be two lines.
01:47It has Ages of Respondents, and then the \n splits it into a second line.
01:52Then we get Social Networking Survey of 202 users, and then we're going to
01:57back up a little more.
01:59Then I'm going to have a label on the x-axis for Age of Respondents.
02:02When I highlight those lines, and click run, now what I have is one that looks
02:09much cleaner, and much easier to read.
02:12We have Age of Respondents going across.
02:14We have a title on the top, so we know it actually is showing us this time.
02:18Also, because it's stretched out the long way, it's easier to see what's
02:22going on in the boxplot.
02:23The notches there require a little bit of an explanation.
02:25The dark black line in the middle of the notch is the median; 50% of the scores
02:30are above, 50% are below.
02:32The notches indicate basically a confidence interval based on the variation
02:36within the distribution, and it can be used compared to other distributions.
02:40So, for instance, one of the options we could have is to make a separate boxplot
02:45of men, and another one of women, and then we can compare the median age of men
02:50and women, or we could make boxplots for the ages of people who preferred
02:53different social networking sites.
02:56Also, the dotted lines are sometimes called the whiskers, and they go to the
03:00highest and the lowest non-outlier scores in the distribution.
03:04If we had outliers, the whiskers would stop, and they would be marked with
03:07separate circles as a way of highlighting both that they are unusual, and
03:11potentially ignorable, depending on the purposes of our analysis.
03:16Anyhow, I encourage you to try the box plots to explore the alternatives that
03:20are part of the box plot function itself, and that carry in from the graphic
03:25parameters, or the pair functions that are available as well, the same way you
03:29can with the bar charts, and with the histograms.
Collapse this transcript
Calculating frequencies
00:00When you're exploring your data to make sure you meet your assumptions, or to
00:04find interesting exceptions, graphics are an excellent first step.
00:08However, most analyses also require the precision of numbers in addition to the
00:12heuristic value of graphics.
00:15Just as we started with graphics for categorical variables, we'll also start
00:18with statistics for categorical variables.
00:21The most common statistics in this case are frequencies, which is what we'll do first.
00:25I am going to use the data set that I have been using so far; social network.
00:28I'm going to come down here, and
00:31because I have it saved in my default location, which I set to be the Desktop, I
00:35can simply run this line to read the CSV file.
00:39I see in the console that that command ran fine.
00:41In the top right in the Workspace, I see that I've now loaded the date set sn,
00:45for social network; it's got 202 observations in 5 variables.
00:49The next thing is to create the default table, and this is a frequency table.
00:54It does it in alphabetical order, and it looks like this when I run it.
00:58What we have is 93 people who indicated that Facebook was their preferred
01:03social networking site, 3 who did LinkedIn, 22 to MySpace, and so on.
01:07Now, this is adequate for getting the numbers.
01:09On the other hand, it would be nice to be able to modify it in a particular way.
01:14This is going to be easiest if I save the table as its own data frame.
01:18That's what I'm going to do in line 15.
01:20So, I'm going to create a new data frame called site.freq, or frequencies of the sites,
01:25and I'm going to use it making the same command here.
01:28So, I'm just going to run it again.
01:30Now you can see that I've created this new data set, and in fact, that shows
01:34up in the Workspace.
01:35It is a table which has six values in it.
01:37Now I'm going to print the table just by writing its name;
01:40just site.freq will print the table, and there it is.
01:43It looks exactly the same as what I had before.
01:45Now what I'm going to do is I'm going to start modifying it just a little bit.
01:50The first thing is I'm going to sort it.
01:52Sorting is kind of a funny thing when it comes to tables.
01:55I'm going to sort it into itself.
01:57I'm replacing this table with a sorted version.
01:59In line 18, you see that I have site.freq.
02:02That's the name of the table.
02:04Then I have the assignment operator, the arrow dash that's read as gets.
02:08Then I say it gets site.freq, but then in square brackets, I put down that
02:13I'm going to order it, and then in parentheses, I put down the basis for the ordering.
02:18In this case, I'm ordering it by the only thing in there, site.freq.
02:21The idea here is that you could order it by another variable.
02:25In this case, I'm also specifying that I want to do it in a decreasing format.
02:30That's why the decreasing equals T for true.
02:32I'm going to run that command, and we see that that run in the console.
02:37The command is there.
02:38Now I'm going to print the table over again by just doing site.freq.
02:43Now you see that it's sorted in order.
02:45It started at Facebook again, then None.
02:48It goes 93, to 70, to 22, to 11, and so on.
02:52These are the counts, the frequencies; how often each one occurs.
02:55On the other hand, sometimes it's helpful to have the proportions of the
02:59percentages, and that's a very simple thing to do with R's built-in table function.
03:04I'm going to use the prop.table function.
03:07That's proportions.table.
03:09I'm going to say what I need the proportions of, and that's site.freq, which I
03:13saved as a table, so it would work on this one.
03:16I'm just going to run that command, and
03:17now you see that I have the same labels -- Facebook, None, MySpace -- in order, and
03:23I have proportions under them.
03:24Proportions go from 0 to 1, where 0 is 0%, and 1 is 100%.
03:29Now, the one problem with this list is that I've got way too many decimal places.
03:34If I want to get it down to just two decimal places, I've got just one more
03:38command I'm going to run here.
03:39I'm going to take the command I just ran in line 21, and I'm going to wrap it
03:44with around, and that tells me that I want to round it, and then at the very end
03:48of that, you see that I have comma, 2;
03:50that means two decimal places.
03:51So, I'm going to run that command.
03:53That's basically how I want it to look.
03:55Now what I have is proportions.
03:58So, it says that 46% of the respondents indicated that Facebook was their preferred
04:03social networking site.
04:05In this particular date set, 1% chose LinkedIn or Twitter.
04:09Depending on your proposes, you may want to report the proportions, or you may
04:13want to report the counts, or frequencies up here.
04:16Usually, actually, you would want to do both.
04:19The nice thing is that the table command in R makes it simple to do both
04:22of those.
Collapse this transcript
Calculating descriptives
00:00In the previous movie, we looked at descriptive statistics for
00:03categorical variables.
00:05In this one, we'll look at some common, and not so common statistics for
00:09quantitative variables, using both R's built-in functions, and some specialized
00:13functions from code packages.
00:16To do this, I'm going to use the same data set: social_network.csv.
00:20I'm going to come down here, and run line 12 to load it into a data frame called
00:25sn, for social network.
00:27We see in the console that that command ran, and on the Workspace on the
00:31right, that that's loaded.
00:33The first thing I'm going to do is simply get the default summary for the
00:37variable age, which is the age of the respondents.
00:41I'm just going to run line 13 here, which says summary.
00:45Then I'm specifying the data frame sn, and then the dollar sign is for the variable age.
00:52By default, what I get is the minimum value.
00:55So, apparently somebody who said they were six years older responded to this
00:59online questionnaire.
01:00Then I have the first quartile value, which is the lowest 25%, then the median,
01:06which is 28 years old, then the mean, which is 31.66, the third quartile, and
01:12then the maximum, which is 70, and 12 people did not respond to the question,
01:17so we have NA's for not available.
01:20An even a quicker way to do this is to get the summary statistics for the entire
01:24data frame at once; the entire table, including the categorical variables.
01:28To do this, all I have to do is run a summary, and then give, in parentheses, the data frame.
01:33Don't even specify a variable.
01:35So, I'm going to run line 14 right now to do that.
01:38I'm going to scroll;
01:40I'll make this bigger by clicking on that right there.
01:43What you see is we have five variables. ID, now, ID is just a sequential one that
01:48goes from 1 to 202, so we can actually ignore that one.
01:51For gender, you see that I have one person who did not respond.
01:54I have 98 who said they were female, and 103 who said they were male;
01:58nearly evenly split.
02:00Then I have my age statistics.
02:01Those are the same as the ones I have right above. Then the number of people who
02:05chose each of the Web sites for their preferred social networking site.
02:10Then the number of times that they say they logged in per week.
02:12This one is an interesting variable, by the way.
02:14We have a lot of people who said they logged in 0 times, and the 25th
02:19percentile score, the first quartile, is 1 time per week, but take a look at the maximum score.
02:24There was one respondent who said that he logged in 704 times per week, which is
02:29physically possible; we did the math.
02:31It's once every 10 minutes for every waking hour during the week, and then 31
02:35people did not respond.
02:37So, this is actually a beautiful summary thing, because it does all the variables,
02:41both quantitative and categorical, in the data set all at once.
02:45Now I'm going to shrink this one back down, and I'm going to do just a couple
02:50of other variations.
02:51One is something that we saw pretty much here.
02:54There's something called Tukey's five number summary.
02:58We basically have it right here.
02:59If we come down to the bottom here, you see that we have the minimum, the first
03:03quartile, the median, the mean, the third quartile, and the max.
03:06If you remove the mean, then you actually have the five number summary, but this
03:10is a really condensed version of it.
03:13So, I'm going to do it just for the age, and run line 20 here, and there we have it.
03:19I'm going to make the console bigger here, so you can see age.
03:23We have 6, 21, 28, it rounds off to 41, and then to 70.
03:29Now, the pros and cons of the five number summary;
03:31the pro is that it's very compact.
03:33It also is nice that it rounds off, and that we don't have the decimal places.
03:37The problem is, of course, that it's not labeled at all.
03:40You have to know what these things are; that they're the minimum, the
03:44first quartile, and so on.
03:45I just want to make it aware, though, that this is an option, and it's something
03:49that's used -- these are the values that are used when drawing box plots that we
03:53did in the other chapter.
03:54Now I'm going to use some alternative descriptive statistics; a really big
03:57set of statistics that includes the mean, the standard deviation, the median, the 10% trimmed mean;
04:05an unusual one: the median absolute deviation from the median, the minimum,
04:09maximum, range, skewness, kurtosis, and the standard error.
04:12I can get all of these at once by using the package psych.
04:17This is an external package, and so we need to download it, but all I have to do is run line 31.
04:25We wait a moment, and now it says that it's installed.
04:28In fact, if I click over here on packages, scroll down a little bit, you'll see
04:32that psych is now there.
04:34It's not checked off, because I haven't loaded it yet.
04:37I have installed it, but I haven't loaded it.
04:39If I run line 32, which says library ("psych"), that will load it, and you see
04:43that it's now checked over there.
04:45I could also check it manually, but I like using the script, because it keeps a
04:49record of everything that happens.
04:50Now all I do is I use the function describe, and I run it for the entire
04:57data frame again, so it's 33.
04:58I'm going to just run 33, and make the console bigger here.
05:04Now what you have is the five variables listed on the left, so
05:10ID, Gender, Age, Site, and Times.
05:12Now, let me make something really important here;
05:16two of these are categorical variables.
05:19Gender and Site are categorical, and they have asterisks next to them.
05:24It is, however, still going through and calculating numerical summaries for them.
05:28What its doing is it's taking the levels, and it's putting them down as one, two,
05:33three, and so on, and there are times when, even though it's categorical, these
05:38kinds of summaries can make sense.
05:40So, for instance, if it's an ordinal variable -- first, second, third -- then there
05:45can be times when averaging them makes sense.
05:48If it's a dichotomous variable, like Gender, for instance, if I were to code it as 0
05:53and 1, even though that's a category, if you get the mean, it tells you the
05:57proportion of people who have ones.
05:59Now, because I have a missing value in this one, the missing value gets the one,
06:04the male gets the two, and the female gets the three.
06:08What you do see here is it tells me that I'm about evenly split, but anyhow, the
06:12important thing here is I've got my five variables listed down the side, and I
06:16have all of these things here.
06:18I have options for controlling the way that the median is calculated.
06:21I have options for adjusting the level of trimming on the trimmed mean.
06:26I have options for controlling the way that the skewness and kurtosis are
06:29calculated, but this is also a very nice format for coming -- this is the sort of
06:34thing that I could copy and paste into a paper, and just adjust the font size a
06:40little bit; get it all on one line.
06:41So, this is a great way of describing a quantitative variable.
06:46Between summary, and describe, and also the five-number summary, we've now taken
06:51a good look at the numerical description of each of our quantitative variables,
06:57and that gets us ready for some of the more detailed analyses that we're going
07:01to do in the next movies.
Collapse this transcript
4. Modifying Data
Recoding variables
00:00When you've taken a thorough look at your variables, you may find that some of
00:04them may not be in the most advantageous form after your analyses.
00:08Some of them may require, for instance, rescaling to be more interpretable.
00:12Others may require transformations, such as ranking, logarithms, or
00:15dichotomization to work well for your purposes.
00:18In this movie, we're going to look at a small number of ways that you can
00:22quickly and easily recode variables within R. For this one, we're going to be
00:26using the data set we've used before, social network, and I'm going to load that
00:31by simply running a line 12 here.
00:34And then I'm going to be using the psych package, because it gives me some
00:39extra options for what I want to do here.
00:41So, I'm going to run line 15 to install it, and then run line 16 to load it.
00:45Now, what I'm going to do right here is I'm going to first take a look at
00:50the variable times; the number of times that people say they log in to their site each week.
00:55The easiest way to do this is with a histogram, because it's a
00:58quantitative variable.
00:59I'm going to run line 19.
01:02What we have here is an extraordinarily skewed histogram.
01:06You see for instance that nearly everybody is in the bottom bar, which says they
01:10log in somewhere between 0 and 100 times per week.
01:14We have somebody in the 100 to 200 range, and then we have another person we saw
01:19before in the 700 to 800 range.
01:21The normal reaction to this might be simply to exclude those two people, because
01:26they are such amazing outliers, and yet, you can do that, but I want you to see
01:30that there are other ways to deal with it.
01:33The first thing I'm going to do is one common transformation; it actually doesn't
01:36change the distribution, it just changes the way that we write the scores, and
01:39that's to turn things into z-scores, or standardized scores.
01:43And what that does is it says how many standard deviations above or below the
01:47mean each score is.
01:49Fortunately, we have a built-in function for that, and it's called scale.
01:53So, what I'm going to do is I'm going to create a new variable called times.z for
01:57z-scores of time, and I'm going to use scale, and then sn for the social network
02:03data frame, and then the variable Times.
02:05So, I'm going to run line 24 here,
02:07and you see that on the right side on your workspace, I have a new variable
02:11that has popped up.
02:12It's actually a double matrix, which is an interesting thing.
02:15I'm going to run line 25, and get a new z distribution; a histogram.
02:20You see, it should look the same as the Times distribution.
02:23It's pretty similar, but it's abended differently.
02:26And so, some of the people who are in the 0 to 100 in range, if they were in,
02:30like for instance, the 50 to 100, they got put into different bin, but you still
02:34see that we have these two incredible outliers here.
02:37I'm going to get a description of the distribution.
02:40This is where I have the trimmed mean, and the median, so on, and so forth.
02:44One of the interesting ones here is at the end of the first line you see
02:48the level of skewness.
02:50Now, a normal distribution has a value of zero for skewness.
02:54This distribution has a level of over 10, which is enormous for skewness.
02:59Even more is on the next line is kurtosis, which you don't always talk about.
03:03One of the things that affects kurtosis, which has to do with sort of how peaked
03:07or pinched the distribution is; it's affected a lot by outliers, and so we end up
03:12having a kurtosis, which for a normal distribution for a bell curve is zero, and we
03:16have this incredibly high value of 120.
03:19Anyhow, that just gives us some idea of what we're dealing with here, and the
03:23ways that we can transform it.
03:24Okay, what I'm going to do next is sometimes when you have a distribution with
03:30outliers on the high end, it can be helpful to take the logarithm.
03:33You can take the base 10 logarithm, or the natural logarithm.
03:37I'm using the natural log here, and what I'm going to do is I'm going to create a
03:42new variable here called times.ln0, and this just takes the straight natural
03:46logarithm of the values.
03:48Now, I'm going to do this twice, because there's a reason why this one doesn't work.
03:52I'm going to just show it to you.
03:54I'm going to run line 29, and now you see on the workspace on the right I've got
03:58a new variable, and I'm going to get a histogram.
04:01The histogram is really nice.
04:02You can tell it's almost like a normal distribution.
04:04It's a lot closer, but if I run the describe, I get some very strange things.
04:09The mean, we have sort of this negative infinity, and we have not a number for
04:13all sort of things, and the descriptions don't work well.
04:16The problem here is that if you do the logarithm, and you have zeros in
04:20your data set, you can't do logarithms for zero.
04:23And so a workaround for this that is adequate is to take all of the scores and add 1.
04:30That's what I'm doing right here.
04:31Now I'm going to create a new variable called times.log1, and what I'm going
04:36to do is I'm going to take the value of Times, and add 1 to it, so there's no more zeros.
04:42The lowest value is going to be 1, the highest is now going to be 705, and I'm
04:47going to take the logarithms of those.
04:48So, I'm going to hit that, and run line 33, and then I'll take a look at the histogram.
04:53You see the histogram is very different, because the last one simply excluded all
04:57the people who said they had zeros.
04:58Now they're in there, and so you can see that the bottom bar has bucked up.
05:02I'm going to run describe now.
05:04Now I actually get values, because I'm not full of infinite values or not a numbers.
05:08If you have zeros, adding 1 can make the difference between being able to
05:13successfully run a logarithm transformation or not.
05:18The next step is to actually rank the numbers, and this forces them into nearly
05:24uniform distribution.
05:25What I'm going to do here is I'm going to use the ranking function.
05:28I'm going to put times, rank, and so it's going to convert it into an ordinal
05:32variable from first, to second, to third, to fourth.
05:35If I just run it in its standard form, you see there it's created a new variable
05:39over there; I'm going to get the histogram of that.
05:42Now, what's funny about this histogram is, theoretically, if we have one rank for
05:47each person, there should be a totally flat distribution, and that's obviously
05:50not what we have here.
05:52The reason for that is because we have tied values.
05:54A lot of people put zero, a lot of people put 1, and so on.
05:57I'm going to run the describe just in case.
06:00There are a lot of ways in R for dealing with tied values.
06:03In line 41, you see, for instance, the choices are to give the average rank, to
06:09give the first one, to give a random value, to give the max, the min, and all of
06:14these are used in different circumstances.
06:16I'm going to use random for right now, because what it does is it really flattens
06:21out the distribution, so I'm going to run line 42.
06:23Now it's going to be times.rankr, for random.
06:28Then I said I'm going to rank it, but I'm specifying how I'm going to deal with ties.
06:33So, ties.method; in this case, I'm going to use random.
06:37I run that, and if you look over here in the workspace, I now have that
06:42variable down at the bottom.
06:43I'm going to come back to the editor, and run line 43, and now look; that's totally flat.
06:50If I run describe, you see, for instance, that the mean's 101.5, which is what we
06:54would expect with this distribution, and it's just flat all the way. Skewness is zero.
06:59We have a negative kurtosis, because this is actually what's called a
07:02platykurtic distribution.
07:04Anyhow, that's exactly what we would expect with a totally ranked
07:07distribution with no ties.
07:09The last thing I'm going to do is I'm going to dichotomize the distribution.
07:13Now, a lot of the people get very bent out of shape about dichotomization.
07:17They say you should never do this, because you're losing information in the
07:21process, and that's true.
07:22We're going from a ratio level variable down to a nominal or ordinal level variable.
07:30So, we are losing some information.
07:32On the other hand, dichotomization, when you have a very peculiar distribution,
07:37can make it more analytically amendable.
07:40More to the point, it's easier to interpret the results.
07:43I do not feel that is never appropriate to dichotomize; to split things into two.
07:48I feel there's a time and a place for it. Just use it wisely, know why you're doing
07:51it, and explain why you did it.
07:54Anyhow, it would feel like the appropriate way to do this would be to say, for
07:58instance, if x is less than this value, then put them in this other group, but
08:02that doesn't work properly. You'll get some peculiar results.
08:05Instead, you need to use this one line function in R; it's called if else, and it's
08:10written as one word.
08:11And in line 48, what I'm going to do is create a new variable. It says
08:15time.gt1, because I'm going to dichotomize it on whether they log in more than once per week.
08:23So, GT stands for greater than one.
08:25And then I have the assignment operator, and then I use the function ifelse.
08:30And then what you do is you have in parentheses three arguments.
08:33The first one is a test, and so I'm going to say is times greater than one; sn
08:39is the data frame, and the dollar sign it says I'm going to use a variable, then
08:44times is the name of the variable, and if that's greater than one, then the
08:48second argument is what to do
08:49if that test is true; then give them a one on the variable time.gt1.
08:55If their score on times is not greater than one, so if it's zero or one,
09:00then give them a zero.
09:02So, I'm going to run line 48, and now you can see over here I have got a new
09:07variable, GT1, and then I'm going to get the description of that one by just writing its name.
09:12And what you can see here is it's printed out the entire data set. It's taken
09:17all the people who said they logged in zero or one times, and it's given them zeros.
09:23Everybody who logged in two or more times got a one, and the people who
09:27didn't respond to the question in the first place still have their NAs for non-applicable.
09:31And so, that's a form of dichotomization of a distribution that can be done in
09:36a way that advances your purposes, and can be done, I feel, with integrity, if
09:40it's done thoughtfully.
09:41These are some of the options for manipulating the data, and getting it ready for
09:46your analyses, and of course, there's an extraordinary variety of what's
09:51available, but these are some of the most common choices, and hopefully some of
09:55them will be useful for you.
Collapse this transcript
Computing new variables
00:00In the last movie, we looked at ways that you could use R to recode or transform
00:04individual variables to make them more suitable for your analyses.
00:08In this movie, we're going to look at ways that you can combine
00:11multiple variables into new composites, and how those procedures can
00:15work for your purposes.
00:16In this example, I'm actually going to be creating my variables in R. What I'm
00:21going to do down here with line 6 is I'm going to create a new variable called
00:25n1, which stands for normal number one, and I'm going to use the function rnorm,
00:31which means random normal.
00:32So, it's going to be drawing values from the normal distribution, the bell
00:36curve, at random, and I'm going to get a million values. It's going to take about this long.
00:41Now I have a million random values.
00:43Let's get a histogram of those.
00:46There you see it's pretty much a perfect bell curve.
00:49It's symmetrical, it's unimodal, it's uniform; it's great.
00:52Then I'm going to do the procedure again and create another variable called n2.
00:57That's also normal distribution; a million values drawn at random.
01:01You can see in the Workspace I've got that one, and I'm going to get
01:04its histogram as well.
01:05It's essentially identical.
01:07Again, it's a normal distribution,
01:08it's unimodal, and it's got the bell curve shape.
01:11Now what I'm going to do is I'm going to create a composite variable.
01:15This is the point here.
01:17I'm going to do it by simply adding each value from these different vectors.
01:22Now, this is the beautiful thing about R is that it's made for vectors, and so
01:26all I have to do is say that my new variable, which I'm calling n.add, in line
01:3114, it gets n1 + n2, and R knows it's to take the first item in n1, and add it
01:39to the first item in n2, then go to the second item in n1, add it to the second item in n2.
01:46So, I'm going to run that line 14, and you see I have a new thing in the Workspace.
01:51I'll get a histogram for that one.
01:53That's also a bell curve.
01:54The range is a little bit larger, because I'm adding instead of just averaging.
01:58Then I'm going to do one more thing; instead of adding them, I'm actually
02:01going to multiply them.
02:03So, I'm going to have n for normal.mult.
02:05And again, because we have these vector- based mathematics, I'll just say n1 * n2.
02:11First item in n1 multiplied times the first item in n2, and so on.
02:15I'll create that one.
02:17It shows up in the Workspace, and I get the histogram.
02:19It's going to look a little different this time.
02:22The reason for that -- you see it's really high in the center, it drops down, and
02:26it goes all the way down to -10, and up to 10.
02:30The reason for that is, when you multiply values from two independent unit normal
02:35distributions, you actually get something that approximates what's called a
02:39Cauchy distribution.
02:40It's a very unusual distribution that has a tremendous number of outliers, and
02:44that's what I've got here.
02:46Now, the one statistic where the Cauchy is most distinctive is in kurtosis, which
02:50has to do with how peaked or pinched the distribution is, and is affected a lot
02:55by the presence of outliers.
02:56In order to get kurtosis easily, I'm going to install the package psych. It installs it.
03:04In line 23, it loads it.
03:06From there, I can calculate the kurtosis for each of my four distributions.
03:11Now, for the normal distributions, I expect it to be close to zero.
03:14So, kurtosis for n1 is essentially 0, and also for n2, it's very close to 0.
03:21I'd expect it to be close to zero for the addition one, but for the
03:26multiplied one, I expected it to be a larger value.
03:28In fact, that's nearly six.
03:30So, you can see the other one is very close to zero, and that the major difference
03:34in the fourth one where I multiplied is in the level of kurtosis.
03:38Anyhow, the idea here is that I've been able to take variables that I created
03:41here, and then combine them in different ways to create new variables.
03:46So, I have these ways of manipulating the data to get these composites, and that's
03:50something that you do, for instance, when you're creating an average score based
03:54on a survey of many different questions.
03:55R makes these vector-based operations very, very easy.
03:59The operations used in this movie are just two options out of an essentially
04:03infinite variety for combining your individual variables into new composite
04:07variables for your analyses.
04:08R makes it very easy to find methods for your own work that can get your data
04:13into exactly the shape that you need.
04:16So, the speed, flexibility, and power of R are especially helpful as you
04:21manipulate data, and get ready for your own analyses.
Collapse this transcript
5. Charts for Associations
Creating simple bar charts of group means
00:00Once you've taken a look at all of your variables individually, and you've gotten
00:04them into the shape that you need for your analyses, the next step is often to
00:08start looking at associations between variables.
00:11A very common form of association is to look at group membership, and how that's
00:15associated with scores on a quantitative outcome.
00:18I'm going to use an example for this to show a couple of different ways of
00:22depicting group distributions by using bar charts, and also by box plots.
00:28For this one, I'm going to be using a data set that is based on Google searches by state.
00:33The idea here is that the Google search data is showing how many standard
00:38deviations above or below the national average each state is in their relative
00:43interest in a search term.
00:45The first thing I'm going to do is I'm going to load a data set called
00:49google_correlate.csv.
00:50I put it into a data frame called Google.
00:52There are 51 observations, because there are 51 states, and there is D.C. Next,
00:56I'm going to just run to see what the names of the variables are. That's line 7.
01:00What we have is State, that's the name of the state, then the state_code, that's
01:06like CA for California.
01:07Then we have their relative interest in data visualization; so, how often do they
01:13search for that relative to their other searches?
01:15Then we also have searches for Facebook, searches for NBA, and for fun, to put
01:21down whether that state had an NBA team.
01:23Also, the percentage of people in that state with a college degree, whether that
01:28state had a K-12 curriculum for statistics, and the region of the country.
01:34Let's take a closer look at that with structure; that's str.
01:38If I hit that, and make this bigger, it gives you the idea of how many levels they are.
01:43It gives you the first few data values.
01:46So, that's a way to seeing what we're dealing with.
01:48I'm going to clear that out, because it's pretty busy.
01:51Put that back down.
01:53One of the interesting questions might be, do the responses to one of these vary by region?
01:59I thought I'd look at data visualization, and I want to see whether it varies by
02:03regions in the United States.
02:04So, the easiest way to do this is to first create a new data set, a table or frame,
02:10where I split the data by region.
02:13So, what I'm going to do in line 12 is I'm going to create my new unit as
02:17searching for data visualization, .reg for region, and then we're going to
02:23get the distributions.
02:24I'm going to use the R function split, and then I tell what it is that I'm going to split.
02:29I'm going to use the data set Google, and the variable data_viz; the dollar
02:34sign joins those two, and I'm going to split it by the variable region that's in
02:38the Google data set.
02:39I'm going to run line 12 now.
02:42You see how that shows up in the Workspace on the right.
02:44So, I have this new list.
02:47Then I'm going to draw boxplots by region.
02:50I'm going to use a boxplot here, and I'm going to go back to my new data frame or
02:55list for interest in data visualization.
02:58I'm also going to color it lavender. There we have it.
03:01What this shows us is the distribution for each region.
03:05So, for instance, you can see here that the box indicates the range of
03:09the middle 50% of states in that region; their relative interest in data visualization.
03:14So, we see that there's a lot of variation in the west, because its boxes are
03:19wider than the others.
03:20There's less variation among the middle 50% in the northeast.
03:24That's because the box is tighter.
03:26But we have outliers in the northeast.
03:28We have one that's unusually low, and one that's unusually high.
03:31Interestingly, the state with the highest relative interest in data
03:35visualization is in the south, and that's where we have a z-score of over three.
03:39You can see the northeast is generally higher than the others, with the exception
03:43of that one outlier.
03:44So, that's one way to get a feel for the variations and distributions by groups.
03:51Another very common way is to do barplots for means.
03:54That's what I'm going to do down here.
03:56I'm going to create another data set here where I'm going to use means.
04:00And so 18 says viz.reg,
04:03so visualization, and the .reg is for region, except this time I'm doing the means.
04:08This makes it so I can do the bar chart.
04:11I'm going to use the R function s apply.
04:13Then I'm going to tell it what I'm dealing with, and that's relying on the list
04:19that I got on the last one.
04:20This time I'm going to be calculating the mean.
04:22So, I'm going to do that in 18.
04:25Then I'm going to run a barplot.
04:26And so I'm telling it barplot what it is I'm charting.
04:30I'm going to color it beige, and I'm going to give it a title that's rather long here.
04:34I'll scroll to the end for a moment. There we go.
04:37By the way, this right here means to break it into a new line.
04:41The backslash is the escape character, and n is the new line.
04:44Then this backslash right here means I actually wants to print these quotes,
04:49because otherwise it thinks I'm done with the title, and then I have to do it
04:53again at the end of data visualization.
04:55This one, because it's not escaped, it means it's the end of that command.
04:59So, I'm going to go back to the beginning, and I'm going to run that command by
05:03itself, barplot, by highlighting those three lines, and then pressing run.
05:10So, now I've got a barplot.
05:11It shows where the average is for each of these groups.
05:14On the other hand, there is one thing that's missing that would be really nice,
05:18and that is we don't have a zero axis line.
05:20Fortunately, I can add that manually with this abline function.
05:23All I've got to do is put the height. It's at zero.
05:27If I highlight all of that, and run it together, now I get the means plot, and
05:32this time, it has the reference line at zero, which is a lot easier to read.
05:38Finally, it would be nice to have the actual numbers that go with each of these things.
05:42What I'm going to do to facilitate this is I'm going to use the psych package again.
05:48The first one installs it, and this one loads it for use.
05:51Then I'm going to do describeBy.
05:53It says, I want to take the variable data_viz, and I want to break it down by region.
06:00This is based on describe.
06:01It just does it categorically.
06:03I'm going to make this one down here bigger.
06:05As you can see that, for each area, I know that there are 12 states in the
06:10midwest, 9 in the northeast, 17 in the south, 13 in the west,
06:14and this gives me the mean for each of these.
06:16So, for instance, you see that the midwest, the mean score is -0.32.
06:20That's what we see over here.
06:23This bar comes down to -0.32.
06:25In the northeast, the mean is 0.45;
06:28it's positive, and we come up here.
06:30Again, these are z-scores indicating relative interest and searching on Google
06:35for data visualization compared to all of the other searches in that area.
06:39Anyhow, these box plots and these means plots are one way of looking at how a
06:44quantitative variable differs from one group to another, and it can often be an
06:49important step in an analysis.
Collapse this transcript
Creating scatterplots
00:00When you're looking at associations in your data, if you want to look at how two
00:04quantitative variables are associated with each other, the most common approach
00:08is to create a scatterplot.
00:10R gives you some interesting options on how to create scatterplots, and look at
00:14what you have in terms of associations in your data.
00:17For this one, I'm going to be using the Google correlate data that I used in the last movie.
00:21I'm going to load it by running line 6.
00:23I'll create a data frame called Google by reading the csv,
00:27google_correlate.csv, that has a header.
00:30There I have 51 observations.
00:32There's one line for each state, and D.C. We're going to look at the names of
00:37the variables that are in that data set.
00:39We can look at the structure too if we want, just to get an idea of what things look like.
00:44I'm going to make this bigger for just a moment.
00:46Okay, that's pretty busy.
00:47I'm going to just clear it out for right now.
00:50What I want to ask is whether there's an association between the percentage of
00:53people in the state with college degrees, and interest in data visualization as a
00:58search term on Google.
00:59What I'm going to do is create a scatterplot.
01:02The default plot works well.
01:04All I say is plot; that means scatterplot, and I give my variables for X and Y.
01:09I'm going to put degree on the X, and so I say, use degree from the data set
01:13Google, and then I'm going to put data_viz on Y.
01:17So, I run line 13, and there's my plot.
01:20You can see that there's a strong positive association. The higher the number
01:25of people with college degrees, the greater the interest in data visualization as a search topic.
01:30That's actually a really clear trend.
01:33On the other hand, I'm gong to clean up this chart a little bit.
01:36I'm going to put a title on the top.
01:38This is lines 15 through 20.
01:40I'm going to do the plot again, except this time I'm going to put a title on the top;
01:45that's main, and then I'm going to put a label on the X axis, xlab, Population
01:50with College Degrees.
01:51Label on the Y axis; Searches for Data Visualization.
01:55Pch here is for representing the points, and I'm going to be using choice number
01:5920, which is a small solid dot.
02:01I'm going to color it in gray.
02:03So, I'm going to highlight those six lines together, and run those.
02:09Now we have this scatterplot with light gray dots, which you can still see the
02:13pattern, but there's less sort of fluff to it.
02:16We have the title on the top, and we have the labels for each axis.
02:19Now I'm going to do one more thing.
02:21When you're looking at an association in the scatterplot, even though we have a
02:25strong positive pattern here, it's really nice to have regression lines.
02:30I can add a regression line with a abline.
02:32I'm going to use a linear model, that's what this is, and it's going to be
02:38based on the association, where I'm trying to predict data_viz, and then the
02:42tilde means predicting it from the number of degrees, and I'm going to color that line red.
02:48So, I'm just going to run line 23, and this is going to layer on top of the plot
02:52that I have already.
02:53So, you can see that there's a strong positive association if we draw a
02:58straight line through it.
02:59On the other hand, not every association is linear, and sometimes it's helpful
03:03to use a line that matches the shape of the data.
03:07One of those options was called a Lowess smoother, and that's what I'm going to do in line 25.
03:13I'm going to add a line, and it's going to be Lowess, and I'm going to be using
03:19it for a degree in data_viz.
03:21Please note that the order of the two variables is different here.
03:24The top one for the regression line, I had to put the Y first, and then the X.
03:28This one, I put the X, and then the Y.
03:30Also, in the top one, I use the tilde to say that the Y is predicted by the
03:35X. This one is simply putting what they are with a comma in between.
03:38I'm going to make this Lowess line blue.
03:41So, I'm going to run line 25, and then I'll just put it on top of that.
03:45A lowess is sort of a moving average, and you can see here that actually it
03:49doesn't deviate tremendously from the linear regression line.
03:53What both of these do is they emphasize the strong positive association between
03:58the percentage of the population in the state who have college degrees, and the
04:02relative interest in searching for data visualization on Google.
04:06These are really good ways of looking at the association between two
04:10quantitative variables, and will lead into regression, which we're going to do
04:14in a later movie.
Collapse this transcript
Creating scatterplot matrices
00:00In the last movie, we looked at how to create a scatterplot to show the
00:04association between two quantitative variables.
00:06On the other hand, sometimes you have several quantitative variables, and you
00:10want to look at the associations between each of them.
00:13One option, in that case, is to create what's called a scatterplot matrix, which
00:17has several scatterplots arranged in rows and columns.
00:20I'm going to use the Google search data.
00:23I'm going to load it in by saying Google gets read.csv, and so on.
00:28Let's just take a look at the variable names in it.
00:31There's what we've got.
00:32State, state_code, data_viz is a search term, Facebook is a search term, NBA is
00:37a search term, whether the State has an NBA team, the percentage of people with
00:41degrees, whether they have a stats_education curriculum in the K through 12 system,
00:46and the region of the U.S.
00:48Now what I'm going to do is I'm going to take each of the quantitative
00:51variables -- data_viz, degree, Facebook, and NBA -- and I'm going to put them into
00:57a scatterplot matrix.
00:58What I'd do is I'm going to first specify that data_viz is the ultimate outcome variable.
01:03I'm just going to stick it on the top left.
01:05Then I'm going to add these other quantitative variables.
01:07I don't have to say Google and then dollar sign for each of these, because I can
01:11specify data separately.
01:12I'm also going to be using solid dots for the data points.
01:16I'm going to put a title on the top that says Simple Scatterplot Matrix.
01:20If I highlight all four of those lines at once, and run those, here's my matrix.
01:24I'm going to zoom in for a moment.
01:27So, what I have here is data_viz on the top left. Going down on the first column,
01:34Data_viz is going to be across the bottom; interest in data visualization.
01:38On the other hand, going across the top row, interest in data visualization is
01:43going to go up on the Y axis.
01:44So, you can see some of these have very strong patterns.
01:47So, we have the column on the left,
01:50the second one down is the association between the data visualization, and the
01:54percentage of people with degrees that we saw before.
01:56It's a very strong pattern.
01:58On the other hand, things like Facebook and data_viz show negative associations.
02:04That's a nice way to get a look at a whole bunch of things at once, but I want
02:08to show a modified version of this that provides even more information.
02:13To do this, I have to use the psych package.
02:16I'm going to download and install it by running line 16.
02:22Then I'll load it, so I can use it with line 17.
02:26In line 18, I'm going to use what's called the pairs.panels, which is a function
02:31within psych, and I can tell it the data set that I'm going to use is Google.
02:36Then I'm specifying which variables I want by the order that they appear in
02:41the Google data set.
02:42Data_viz is the third one, Degree is the seventh, Facebook is the fourth, and NBA is the fifth.
02:48That's why I'm specifying those with c for the concatenator combined into a function.
02:53Also, I'm making it so there are no gaps between the panels here.
02:57You see, for instance, in the Simple scatterplot matrix on the right, we've
03:00got the thick bars in between them that unfortunately become visually pretty prominent.
03:04I'm going to get rid of those by putting gap = 0.
03:07This makes an unusual matrix.
03:09So, we'll run that, and take a look.
03:12Then I'm going to zoom in on this one.
03:15What we have here are several things.
03:17First off, we have a histogram for each of the four quantitative variables.
03:22On top of it, we have overlaid, what is called a kernel density estimator.
03:26It's like a normal distribution, but you see it can have bumps in it.
03:30You'll see that on degree.
03:32At the very bottom of that, it's really tiny here, but we have sort of a
03:37dot plot that shows where the actual scores are for each one with these
03:41tiny vertical lines.
03:42Then what we have are the scatterplots.
03:45These are on the bottom left side of the matrix.
03:49We have the scatterplot with the dot for the means of the two variables.
03:53We have a lowess smoother coming through;
03:54that's the curved red line.
03:57Then the ellipse is sort of a confidence interval for the
04:00correlation coefficient.
04:02The sort of the longer and narrower the ellipse, the stronger the association,
04:07the rounder, the less the association.
04:11The numbers that are on the top side are mirror images of these, and those are
04:15correlation coefficients for each one of them.
04:17So, for instance, we can see that the correlation between data_viz, and degree is
04:21positive, and it's 0.75.
04:23Correlations go from zero to one.
04:26Zero is no linear relationship, and one is a perfect linear relationship.
04:30They are positive if there is an uphill relationship, and negative if it's downhill.
04:34That's a very strong association.
04:36On the other hand, you can see that interest in data_viz, and interest in NBA as
04:41a search term -- that's the scatterplot
04:43that's in the very bottom left --
04:45it's kind of circular and scattered all over the place.
04:48If you look at the very top right of this matrix, you see the correlation is 0.23.
04:51It's not very strong.
04:53Anyhow, this is a really rich kind of matrix that shows histograms, it shows
04:59dot plots, it shows kernel density estimators, it shows scatterplots with
05:03lowess smoothers, and its correlations, and it's one of the great reasons for
05:08using the psych package.
05:09Anyhow, this is a variation on the scatterplot matrix, which lets you look at
05:14graphically the association of several quantitative variables at once, and get a
05:18really good feel for the interrelationships within your own data set.
Collapse this transcript
Creating 3D scatterplots
00:00In the last movie, we looked at how you could use scatterplot matrices to show
00:04the associations between several quantitative variables simultaneously by
00:09creating a 2D matrix of scatterplots.
00:12In this movie, I want to look at an interesting variation where you actually use
00:16a 3D scatterplot that rotates in space with the mouse.
00:21To do this, I'm going to use the data set google_correlate that I've used
00:25for the other ones.
00:26I'm going to load it on line 6.
00:28Just get a list of names with line 7.
00:30Then there's actually several ways to do 3D scatterplots in R.
00:34I'm going to be using the package rgl.
00:40I've now downloaded it, and installed it.
00:42Now I'm going to open it to run.
00:44Then what I'm going to do is just run this one set of code.
00:47plot3d is the function, and then you need to list the x, y, and z variables.
00:53So, I've got them all as data_viz from the Google data set degree from the
00:57Google data set, and Facebook.
00:59Those are relative interest as search terms.
01:02Then I'm also adding labels for the x, y, and z axis.
01:05I'm going to color the dots in the scatterplot red, and make them three pixels.
01:11If I highlight all of that code, and run it --
01:14this plot is a little different, because it doesn't open in the bottom right
01:18window, and instead, it opens a new window.
01:20I'm going to come down here and click.
01:22I can make that larger.
01:24What I can do now is click on the mouse, and drive this one around to see the
01:30association in three dimensions.
01:36Now, this is a nice heuristic thing, although it usually only works while it's
01:40moving, because as soon as you stop moving, it collapses, and it's hard to read what it is.
01:45But it does give interesting possibilities for looking at the associations
01:49between three variables, so we can try to find the strongest association.
01:53We've got a data point way up here in the corner.
01:58Anyhow, while it's interesting for exploring, it's hard to report these,
02:02especially in a printed 2D format, but a 3D scatterplot, an interactive spinning
02:06one, can be a potentially informative, and certainly an engaging way of exploring
02:11the relationship between several quantitative variables.
Collapse this transcript
6. Statistics for Associations
Calculating correlations
00:00Once you've looked at the associations between several quantitative variables,
00:04a natural next step is to start looking at the numerical associations between them.
00:09The most common way of doing this is with correlations or Pearson product-moment
00:13correlation coefficients.
00:15In this movie, we're going to look at how to calculate correlations for
00:18individual pairs of variables, as well as to create a matrix for an entire set of variables.
00:23We're going to do this with the google_correlate data.
00:26I'm going to load that right here, and just remind myself of the variable names.
00:31Then what I'm going to do is I'm going to create a new data set that has just
00:36the quantitative variables.
00:38There are several ways to specify these.
00:41What I'm doing is I'm going to put g for Google, .quant, for quantitative, and
00:46that gets from the Google data frame;
00:48I'm going to select four variables,
00:50and what I'm using is the concatenate function.
00:53That's the c. I'm selecting the variables by their number of where they appear.
00:58That's why I have this names list right here.
01:00So, data_viz is the third one, degree is the seventh, Facebook is the fourth,
01:04and NBA is the fifth.
01:07I'm going to create that new set.
01:09You can see that shows up there in the Workspace under g.quant; 51
01:13observations, one for each state, and for D.C., with these four variables in
01:17that particular order.
01:18The next thing I'm going to do is I'm going to get a correlation matrix for
01:22that entire data set.
01:23R has a built-in function, cor, for correlate, and all I have to do is specify my
01:28data frame here, and I hit run.
01:31What I get is a bunch of correlations.
01:33Remember, correlations go from zero, which means no linear association at all,
01:38to positive or negative one, which indicates a perfect linear association.
01:42Positive is an uphill relationship.
01:44Negative is downhill.
01:46On the diagonal, we have ones.
01:48That's a variable correlated with itself.
01:50You see that we have some really strong correlations.
01:52So, for instance, the association between data_viz and degree is 0.745.
01:59That's a very strong correlation.
02:02Also, the association between data_viz and Facebook as interest as search terms
02:07is negative, and very strong. That's -0.63.
02:12So, the more interest there is in searching for Facebook, the less interest there
02:16is in searching for data visualization, and vice versa.
02:19This is a correlation matrix that is without the probability tests associated,
02:26and I want to show you how to deal with those.
02:29Now, the easiest way with a built-in function in R is to do one correlation at a time.
02:34So, you pick one x variable, and one y variable, and then use the function cor.test.
02:40That's correlation test.
02:42What's it's going to do is give the correlation coefficient, the hypothesis
02:46test, the p-value associated with that, the confidence interval.
02:50In this one, I'm specifying my variables by saying the variable name, and then with
02:54the dollar sign, also the data set that it comes from.
02:57So, I'm going to run line 17 for cor.test right now, and look at data_viz and degree.
03:03I get a fair amount of printout from this one.
03:05It tells me that it's doing the Pearson's product-moment correlation coefficient,
03:09because there are other choices.
03:11It's telling me the two variables that I'm using.
03:13It's giving me a t-test for the significance test.
03:17The value of t is 7.83, with 49 degrees of freedom, which has to do with the
03:22sample size, and the probability of getting a correlation this big through
03:26random chance is extremely small.
03:28In fact, you see that we have to go to -10, and there are a lot of zeroes there.
03:34The 95% confidence interval for this correlation coefficient is from 0.59 --
03:39that's the low N -- to 0.84.
03:41So, it's going to be a high correlation either way.
03:43Then we have the actual sample correlation there at the bottom.
03:46It's 0.7455, which is what you see up in the matrix above also.
03:51That's a good way to do it if you're willing to do one correlation at a time.
03:55On the other hand, if you want to do the entire matrix at once, what you can do
04:00is get a probability matrix by using the package Hmisc.
04:04I'm going to download and install the package Hmisc.
04:09That downloads it and installs it.
04:11Now I'm going to load it. It's okay.
04:13I've got this little information here about some of the changes that it's
04:16making, but that's fine.
04:17I can ignore those.
04:18Now, the only trick here is I'm going to use the function rcorr;
04:23correlations, but the thing is I have to take my data set g.quant, and it has
04:28to become a matrix.
04:31Right now, it's a data frame.
04:32See, a data frame can have lots of different kinds of data in it, that each
04:36variable can be of a different kind, but a matrix has to be all the same kind.
04:39So, what I'm going to do is I'm going to coerce it into being a matrix.
04:43That's the term here.
04:44So, I'm using the function rcorr, and then I put as.matrix, and then I put my
04:50little data frame right here.
04:51That says treat it as a matrix, or coerce it into be in the matrix, and then do the correlations.
04:56So, I'm going to run that now; line 25.
04:59Let's make this bigger, so we can see what's going on.
05:02What I have here on the top is the correlation matrix.
05:04It's the same as what I had above earlier.
05:07Let me scroll up a little bit.
05:11There's the correlation matrix.
05:12The two differences are, that one's got a lot of decimal places,
05:16this one has only two, so it's more manageable, plus this one actually says the sample size.
05:20It says 51 there, and equals 51. But the really important part, and the reason
05:24I did this one, is because the second matrix says the P there; these are the
05:30probability values.
05:32If you're doing an inferential test -- and what you're looking for here, for
05:35statistical significance, is a value that's less then 05.
05:38For instance, the probability value for the correlation between data_viz and
05:43degree, it comes out as four zeros.
05:46It's not totally zero.
05:47It's just going to take longer to show up.
05:49The association between data_viz and Facebook, also, lots of zeros. It's significant.
05:54But the association between data_viz and NBA, the correlation, if you look above,
05:59is 0.23, and the probability value is 0.01.
06:03So, that's not statistically significant, nor is the association between degree and NBA.
06:09That's fine, but the idea here is that we can look at what the actual
06:13correlations are for several variables at once, and
06:16using this package, Hmisc, we can also get the probability values associated with each one.
06:22That's a great first step in looking at the statistical associations between the
06:27variables in my data set.
Collapse this transcript
Computing a regression
00:00When you're trying to understand the associations in your data, it's helpful a
00:04lot of times to single out one particular outcome variable and that you want
00:09to see how you can predict scores on that one using the other variables in your data set.
00:13This is the situation where you would want to use a multiple regression with
00:17multiple predictors for a single quantitative outcome variable.
00:21Fortunately, R makes this extremely simple.
00:24In this example, I'm going to use the google_correlate data that I've had before.
00:28I'm going to load it right here with line 6, and just check the names of the variables.
00:33I just want to mention a couple of things.
00:35I'm going to be predicting interest in data visualization.
00:38This is how common that term is as a Google search term relative to other
00:43searches on a state by state basis. That's my outcome.
00:46I'm going to be using several quantitative variables.
00:49I'm going to be using degree; that is, the percentage of people in the state with
00:53the degree, and Facebook as a search term, and NBA as a search term, but I'm
00:58going to be throwing in a couple of other interesting ones that normally you
01:02would think would require some extra prep.
01:03Stats_ed, which is my second predictor here, is a yes/no variable, and it's
01:09entered as text; whether they have a curriculum for Statistics education in the
01:14K through 12 system or not.
01:16Also, region; let me scroll over a little bit here.
01:19Region is a categorical variable with four levels on it.
01:23This is going to be an interesting one, because normally, I would need to do some
01:28sort of transformation to make this work, but R is smart, and it takes care of
01:32these things all by itself.
01:33So, what I'm going to do here is I'm going to create a multiple regression model.
01:38On line 9, I'm going to assign it to a variable.
01:41I'm calling it reg1, for regression one.
01:45Then, what I'm going to use is I have the assignment operator, and lm is for linear model.
01:51The first thing I specify in there is my outcome variable; that's data_viz.
01:55Then the tilde sign next to it means as a function of, and then I give
01:59all the predictors.
02:00I have degree + stats_ed + facebook, and so on until I get to the end; I have a comma.
02:08Then I have a single thing that says, all of these variables come from the
02:11data set Google, so I don't have to put the Google dollar sign in front of each one of them.
02:16I'm going to select these three lines, and run those.
02:20That is performing the regression.
02:22Interestingly, it doesn't give me the results.
02:24You can see in the Workspace that it's run it.
02:26I have reg1 over here; it's a linear model, but if I want to see the results, I
02:32need to ask for a summary of the regression.
02:34Remember, I saved it as reg1, and so now I'm going to get the summary of that
02:38just by running line 12.
02:39I've got a lot of output here, so I'm going to make the console bigger and
02:44come up a little bit.
02:45Here's what's going on.
02:46The first one says what the actual model is.
02:49The function that I'm calling is lm, linear model, and the formula that I'm
02:53using, data_viz, is a function of all of these other things put together, and
02:57they all come from the Google data set.
02:59Now, the residuals are a way of assessing how well the model fits the data.
03:04There are situations where you would want to use those.
03:06Normally, you would plot them instead, but there they are.
03:09It's the ones below that that are particularly interesting; it's the coefficients.
03:14The column on the left gives the name of the variable, and you see that we have
03:18degree, and then stats_edyes.
03:21What it's done is it's taken may stats_ed variable, which had two values, yes and
03:26no, and it's automatically decided to put it as yes as a one, and no as a zero.
03:33Then we have Facebook, and NBA, which are fine, because those are both
03:36quantitative variables.
03:38Then whether they have an NBA team -- I don't expect that to be associated, but
03:42I've included it just because I could.
03:44Then I have three region variables.
03:47The reason there are three when there are actually four regions is because in
03:50order to avoid multicollinearity, you have to leave one of them out.
03:54Otherwise, there's a perfect association between the predictors.
03:58Then we look at the column that says Estimate.
04:00These are the actual regression coefficients.
04:04Then we have standard error for them, and then the t value as the inferential test.
04:08The probability value on the end is the significance test.
04:11Fortunately, it's also putting asterisks next to the ones that are
04:15statistically significant.
04:17The intercept is significant, which means the intercept is not zero;
04:20we don't really care about that.
04:21What we do have however, are two statistically significant predictors that we
04:25can use to predict interest in data visualization as a Google search topic.
04:29The first one is degree.
04:31States that have a higher proportion of people with college degrees also show a
04:37higher interest in searching for data visualization.
04:40The other one that is statistically significant within this context is Facebook,
04:44except this time it's negative.
04:46States that show a higher interest in searching for Facebook show a lower
04:51interest in searching for data visualization.
04:54This particular regression model is what's called a simultaneous entry; that
04:58takes all these variables, and it throws them in there all at once.
05:02It highlights the ones that are statistically significant within the context of
05:05that entire collection.
05:07What we have here is just two: more degrees, more interest in data_viz: more
05:12interest in facebook, less interest in data_viz.
05:14Then at the very bottom here, we also have some summaries for the entire model.
05:19We have the residual standard error.
05:21I'm not really worried about that right now.
05:23The multiple R-squared is an important one, because that tells us what proportion
05:28of the variance in the dependent or outcome variable, that's data_viz, can be
05:32predicted by the combination of these other variables.
05:37My multiple R-sqaured is 65,
05:39so 65% of the variance in data visualization as a relative search term from
05:45state to state could be predicted from these other variables.
05:49The adjusted R-squared has to do with the relationship of predicators to sample
05:52size, and because I actually do have a small sample, because it's a state by state
05:56thing; it's a little smaller, but still, it's a good prediction model.
06:00Then I also have the F-statistic, which can be used as an inferential test for
06:03that R-squared, and just confirms that it's statistically significant.
06:07Anyhow, this is the simplest possible version of a multiple regression that you
06:12can do in R. It's a way of taking several variables, both quantitative, and
06:17dichotomous predictors, and multiple category predictors, and throwing them in
06:22there. R processes them appropriately, and we're able to get a prediction of a
06:26single quantitative outcome, and it's a great way to start looking at the
06:30important associations in your data.
Collapse this transcript
Creating crosstabs for categorical variables
00:00When you're looking at the associations in your data set, a lot of times you're
00:04going to want to look at the associations between two categorical variables, and
00:08that's when you want to use a cross tabulation, and usually a chi square test of significance.
00:14That's the simplest possible version of it.
00:17In this example, I'm going to be using the social network data, though I need to
00:21mention, I did make one modification to it.
00:23There was one case that did not have information on gender.
00:27Since I'm using gender here as a predictor variable, I wanted to have that
00:31missing case out, so I deleted the one case.
00:33So, we're going to go from 202 cases to 201.
00:36I'm going to list the names of the variables.
00:40We have ID, gender, age, their preferred social networking Web site, and the
00:45number of times that they log in per week.
00:48I'm going to be looking at the association between gender and site to see, for
00:52instance, if men and women report different Web sites as their preferred method
00:56for social networking.
00:58The easiest way to do this is by creating a contingency table.
01:02I'm going to call it sn.tab.
01:04That's for social network dot tabulation or table.
01:08I'm using the table function that's part of R. All I need to say is what my two
01:13variables are; two categorical variables, and I'm using gender, and the sn -- the
01:18dollar sign means it's from the sn data set -- and I'm using Site.
01:22So, I'm just going to run line number 11.
01:25You see that the table shows up in the Workspace there on the right.
01:28Then on line 12, I just have sn.tab.
01:31That's just going to put it out.
01:33So, there I have the number of men and women who report Facebook, LinkedIn,
01:37MySpace, None, Other, and Twitter.
01:40Looking at this, you can see there's a couple of interesting things.
01:43First off, identical numbers of men and women prefer Facebook.
01:47LinkedIn, Twitter, and Other are so small as to be negligible here.
01:51Again, this data set is a few years old.
01:53You see that MySpace has a much higher number of women reporting it as their
01:58primary method, and then for None, there's a lot more men who say they use None.
02:04These work in with some expected patterns.
02:06Now, these are just the frequencies or the counts; the cell frequencies.
02:10On the other hand, it can be really nice to get marginal frequencies, which are
02:14the totals for the rows and the columns, and it can also be nice to get
02:17percentages or proportions.
02:19So, what I'm going to do is I'm going to scroll down here.
02:22First, just get the marginal frequencies.
02:24I'm going to get the row frequencies, and that's going to be just the number of men and women.
02:29So, I have 98 women and I have 103 men.
02:33The fact that they both have 46 in Facebook; they're closely balanced anyhow,
02:37so that's essentially the same.
02:39Now I'm going to look at the column marginal frequencies, and that tells me the
02:43overall number of people who prefer each social networking site.
02:47We've seen this before when we've done bar charts for this variable, but now a
02:52more interesting one is to get the proportions of people within each cell, and
02:57also the proportions who report using each one of these.
03:01To do this, I'm going to use prop.table.
03:04That's proportions for the table, but I'm wrapping it up in a thing that rounds
03:09off the number of decimal places.
03:10It gives a huge number by default, and I only want two.
03:14What I'm doing with each one of these is, to get the cell percentages,
03:17I'm doing prop.table right here, and it tells that I want to use sn.tab,
03:22that's the table for social network as my data set, and I'm wrapping it in
03:27round to two decimal places.
03:30So, I'm going to run line 20.
03:3223% of respondents in this data set are women who said they like Facebook.
03:371% are men who said they like Twitter.
03:41These, all together, these 10 numbers add up to 100.
03:44Now let's look at the row percentages.
03:47Similar procedure, but now what they do is they add up to 100 going across.
03:53Say, for instance, we had dramatically different numbers of men and women.
03:56This would allow us to compare the relative interest in each of these sites, even
04:00with unbalanced marginal frequencies.
04:03You can see, for instance, that MySpace, the numbers mirror what we saw earlier.
04:0818% of the women like MySpace, whereas only 4% of the men.
04:12Then finally, line 22, let's just do a similar thing going in the other direction.
04:17Now these percentages add up going down.
04:20So we see, for instance, that for MySpace, 82% of the people who said they like
04:24MySpace were female; 18% were male.
04:27So, these are ways of looking at the data in several different dimensions.
04:31The last thing that I'm going to do is I'm going to actually do an inferential
04:35test to see if the distribution of preferred networking sites differs by gender.
04:42This is a statistical significance test, and I'm using chi square in
04:45this particular case.
04:46The function for this is chisq.test, because we're doing the inferential
04:52test, and then what the data set is the tabulation, or the table that I'm working, sn.tab.
04:58I hit that one, and it's doing the Pearson's Chi-squared test.
05:02It tells me what data I'm using, and then it's doing the X-squared here.
05:06So, the value for chi squared is 13.2076, and with 5 degrees of freedom. The
05:11probability value, and that's the one that I'm really interested in here, is 0.02.
05:15That's less than 0.05, which is the standard cutoff for
05:19statistical significance.
05:20So, this tells me that the variations between men and women in their preferred
05:25social networking sites, those are bigger than we would expect by chance; that
05:28they, in fact, are likely reliable differences between men and women in what they prefer.
05:33This shows up in terms of women are much more likely to prefer MySpace than men
05:37are, and men are much more likely to report that they have no preferred site.
05:41This warning message on the bottom, it says that chi squared approximation may
05:45be incorrect; that's going to have to do, because I have a relatively small
05:49sample, and I have some, what are called, sparsely populated cells.
05:53Normally, for a chi square to be reliable, you're going to want to have a certain
05:58expected frequency of five or ten cases per cell; not observed frequencies, but
06:02expected frequencies, which is a different thing.
06:04But mostly, I may want to exclude some of these social networking sites from the
06:08analysis, or combine them, so I can bump up the expected frequencies, and better
06:13meet the requirements of the chi square.
06:15That being said, I still have good evidence that suggests that there are gender
06:19differences in preferred social networking site by using the cross tabulated
06:22data, and the chi squared test for significance.
Collapse this transcript
Comparing means with the t-test
00:00One common inferential test is to compare two groups on a single
00:04quantitative outcome.
00:05While there are several ways to do this, the most common is to use a T-test.
00:09In this particular example, we're going to show that this is a very simple thing
00:13to do in R. I'm going to use the google _correlate data that I've used before.
00:17I'm going to load that.
00:19I'm just going to bring up the list of names.
00:21What I'm going to do here, just for fun, is I'm going to look at interest in NBA
00:26as a search term, and see if that differs between states that have NBA basketball
00:32teams, and states that don't.
00:35So, all I need to do is come down here to line 10.
00:38I'm using the function t.test. That makes sense.
00:40I'm saying what my outcome variable is.
00:43That's NBA; that means as a search term, and then the predictor is whether they have an NBA,
00:49so has_nba is a yes/no variable.
00:50I'm just going to run line 10 here, and maximize this one.
00:55You see here that it's telling me that it's using the Welch Two Sample t-test.
00:59That is something that allows for unequal variants between samples.
01:03It tells me the data that it's using.
01:04It's using the variable NBA as the outcome, by whether a state has an NBA team.
01:10The value for t is -4.745, and then the degrees of freedom is 37.105.
01:17You get a fractional degree of freedom when you use the Welch test.
01:20The p-value is 3.071, but there are a lot of decimal places in front of it.
01:26What this tells us is that it's statistically significant; that there is in fact
01:30a reliable difference.
01:31You can get that by looking at the 95% confidence interval.
01:34You see that the difference between the two groups ranges somewhere
01:38between -1.6, and -0.6.
01:43Since this is based on something that has an average of zero for the nation,
01:47that's a reasonable size, and then in fact, it gives us the means for the two groups.
01:52The groups that do not have an NBA team, the states that don't have one, they're
01:57average interest in NBA as a search score is negative. It's -0.5.
02:03That means that they as a group are half a standard deviation below the mean
02:08in searching for NBA.
02:09On the other hand, the number on the right, the 0.62, that is the z-score for
02:15search interest in NBA for states that do have a team.
02:18So, they're a little more than half a standard deviation above the mean.
02:22Anyhow, this is a very simple procedure.
02:25It compares two groups, those who do or do not have NBA teams, on a single
02:30quantitative outcome, and in this case, we found a statistically significant
02:33difference between the two groups.
Collapse this transcript
Comparing means with an analysis of variance (ANOVA)
00:00When you're looking at associations in your data, the final test that we want to
00:03look at right now is comparing several groups on a single quantitative outcome.
00:09If you're comparing just two, you would use a t-test, but when you have more
00:12than two, you usually want to use an analysis of variance, or ANOVA instead.
00:16For this example, I'm going to use the google_correlate data that we've used before.
00:21I'm going to load it, and just get a list of the variable names.
00:25The first test that I'm going to do is what's called a one way ANOVA.
00:30That is where you're comparing several groups, but on a single factor.
00:34So, what I'm going to do here is I'm going to look at interest in data
00:37visualization by region.
00:40I have four regions.
00:41The way I set this up is, first, I'm going to create a model here that I call anova1.
00:47By the way, I'm using the assignment operator <- to save this.
00:50The function is aov, for analysis of variance.
00:54Then I specify the outcome variable, which is data_viz, and then the tilde; it
00:59could be read as a function of, or as predicted by, region.
01:03Then I have the comma.
01:05That says both of these came from the data set Google.
01:07That way I don't have to put Google dollar sign in front of each one of these.
01:11I run the model by simply hitting run on line 10.
01:14You can see that it showed up there in the in the Workspace.
01:16I have this model anova1.
01:18Then I'm going to get a summary of this model by running 11.
01:22What I have here is I have the degrees of freedom for the model, based on region.
01:28There are four regions, so there's three degrees of freedom, and I have the residuals.
01:32Then I have what's called the sum of squares, and the mean squares, and I have the F value.
01:37The F, which is 1.059, is the inferential test.
01:40The last one, Pr(>F) is the probability value.
01:45If that value is less than 0.05, then I usually have a statistically significant
01:50difference between my groups.
01:52Now, this one is 0.376.
01:54That's much higher than 0.05.
01:56What this tells me is, while there is a difference between the means of these
01:59four regions, there's about 38% chance of getting a difference that big just
02:04through random error, and so that's considered just random fluctuation.
02:07This tells us that even though there are differences between the means, it's not
02:11considered statistically significant or reliable.
02:14Now, that's a one way analysis of variance, where I'm using a single
02:17classification variable.
02:18What's really common, for instance, in experimental research is to do a two way
02:22classification, or a factorial design.
02:25Now, there's two different ways to specify this, and I'm going to show you both of them.
02:29They give the exact same results.
02:30The first one, I'm going to save as the model that I'm calling anova2a, and
02:35I use the same function, aov, for analysis of variance, and I specify my
02:40outcome variable, data_viz, and then the tilde to say that it's a function
02:44of, or predicted by.
02:45Then I'm going to use region again, and I'm going to throw into it whether a
02:50state has a stats education curriculum in the K through 12 system.
02:55Then I'm also going to have the interaction between those two.
02:58So, the region colon stats_ed is a way of specifying the interaction.
03:03That's an important thing when you do a factorial analysis of variance.
03:06Then the last line says, and all these variables come from the data set Google.
03:10So, I'm going to run that model by highlighting those three lines, and I press in run.
03:16You can see that it showed up in the Workspace there.
03:18I'm going to get the summary by running line 18.
03:21What we have here is several lines. One is for region.
03:26It says, is there a difference by region all by itself?
03:29The second is for stats_ed.
03:31Is there a difference by stats_ed all by itself?
03:34The third one is the interaction of region and statistics ed, and it says,
03:39does the average score for region depend on whether they have stats education or not?
03:44Actually, what you see here is we have the degrees of freedom, then the sum of
03:48squares, the mean squares, and then the F value.
03:50The F value is the inferential test.
03:52In the last column, the Pr is the p-value; the probability value.
03:56If those are less than 0.05, then it's statistically significant.
03:59You can see that none of them are.
04:01So, really, this tells us that these two predictors, region, and the presence or
04:06absence of a stats education curriculum, and their interaction are not
04:09significantly associated with interest in data visualization on Google.
04:14I'm just going to show you the exact same test in a different way, because there
04:17are two ways to specify it.
04:19This one, I think, is a little easier.
04:21This one, the model is anova2b, because it's my second ANOVA, but I'm setting it
04:26up in the second way, so that's the B. I use aov, for analysis of variance,
04:31data_viz is my outcome, the tilde for predicted buy, or as a function of.
04:35This time, instead of spelling out all three, I just say region*stats_ed.
04:40So, this by that, and both come from data set google.
04:44I highlight those three lines, and run them.
04:46That shows up in the Workspace on the right, and then I'm going to get the
04:49summary for that one.
04:50I'm going to make this console a bit bigger right now.
04:54You can see that I have the exact same results between the two different ones.
04:58I just find the second version of the analysis of variances is, for me, easier to
05:02set up, although the earlier one is more explicit, where you're spelling out the
05:06main effect and interaction.
05:09Anyhow, in this particular case, these effects were not
05:12statistically significant.
05:13Analysis of variance can be a really good way at looking at group differences on
05:17a quantitative variable.
05:19In experimental research, it's often the analysis that is of the
05:22greatest interest.
Collapse this transcript
Conclusion
Next steps
00:00Thanks for joining me on Up and Running with R. Before we go, I want to give you
00:05a few tips on directions you can take to better understand R, and how you can use
00:10it in your data analysis.
00:11Now, I've actually saved an R script that has this information in it, just as
00:16text and comments, but let me give you some ideas.
00:19First off, there are additional courses on the Lynda.com Online Training Library
00:24that would be worth investigating.
00:25One, for instance, is Interactive Data Visualization with Processing; a language
00:30that is command line, like R, but developed specifically for creating graphics.
00:36Another one is SPSS Statistics Essential Training.
00:39SPSS is another very common statistical package.
00:43In that course, which is in greater depth on a lot of the statistical
00:46procedures, can give you information about what the procedures would look like
00:49even when they're conducted in R. Similarly, Lynda.com has a collection of
00:55courses on the use of databases, such as SQL, MySQL, or MongoDB.
01:01In addition, there are a number of books that can be useful.
01:04One is R in a Nutshell:
01:06A Desktop Quick Reference (2e) by Joseph Adler.
01:10Also, the R Cookbook by Paul Teetor is a great reference for practical examples
01:16of working with data.
01:18In a similar vein, the R Graphics Cookbook by Winston Chang gives detailed
01:23information on producing graphs, and modifying them, with the tremendous
01:27flexibility offered in the R program in Language.
01:30In fact, there's a very long list of books available at the R project Web site.
01:34Just see the URL that's in the script for this movie.
01:38Also, there are a couple of books available that are specific to RStudio, which
01:43we've been using in this course.
01:44One is Getting Started with RStudio by John Verzani, and the other is Learning
01:50RStudio for R Statistical Computing by Mark P. J van der Loo and Edwin de Jonge.
01:56There are also a number of Web sites that provide very active and comprehensive
02:01support for R. The most significant of these is the R project Web site itself,
02:05r-project.org, which is a tremendous resource, and a gateway for other sources.
02:11They also publish the R Journal.
02:13That's an open access refereed journal of the R project for statistical
02:17computing, and that's available at journal.r-project.org.
02:21In addition, there are hundreds of Web sites.
02:24One of the nice things is the Web site r-bloggers.com is a compilation of Web sites.
02:30That is, it's news and tutorials about R contributed by over 400 bloggers.
02:34It's a very active Web site with 200 to 300 posts per month.
02:38There's also a specialized search site.
02:41It's rseek.org by Sasha Goodman, and that allows you to specifically search
02:46information relevant to R. Also, StackOverflow has discussions on R. Just search
02:53for those ones that are tagged on R. At Wikibooks, there's an R Programming
02:58Wikibook available also.
03:00You can see the URL available in the script.
03:02In terms of software, you might also want to look at Rcpp.
03:07Those of you who are comfortable with C++ can find an implementation of R in C++
03:14written by Dirk Eddelbuettel and Romain Francois that gives vastly improved
03:19speed for large calculations.
03:22There's a series of tutorials by Hadley Wickham available for this through
03:26github that you can find on the URL in this script.
03:30Finally, there are also support groups and events available for people who
03:34use R. The most significant is the useR!, that's with a R and exclamation
03:39point, which is an international conference that takes place in June or July of each year.
03:44Many large cities have local R user groups, and you can see a complete list of
03:48this at Revolution Analytics at the URL provided in this script.
03:53No matter how you decide to pursue it, and the purposes that you use R for, I
03:57think you'll find that there is tremendous potential, flexibility, and the
04:01opportunity to adapt R to whatever purposes you have, and I think you'll be
04:07extraordinarily pleased with what you can accomplish with R. Happy computing!
Collapse this transcript


Suggested courses to watch next:

SPSS Statistics Essential Training (5h 5m)
Barton Poulson


Managing and Analyzing Data in Excel (1h 32m)
Dennis Taylor


Are you sure you want to delete this bookmark?

cancel

Bookmark this Tutorial

Name

Description

{0} characters left

Tags

Separate tags with a space. Use quotes around multi-word tags. Suggested Tags:
loading
cancel

bookmark this course

{0} characters left Separate tags with a space. Use quotes around multi-word tags. Suggested Tags:
loading

Error:

go to playlists »

Create new playlist

name:
description:
save cancel

You must be a lynda.com member to watch this video.

Every course in the lynda.com library contains free videos that let you assess the quality of our tutorials before you subscribe—just click on the blue links to watch them. Become a member to access all 104,069 instructional videos.

get started learn more

If you are already an active lynda.com member, please log in to access the lynda.com library.

Get access to all lynda.com videos

You are currently signed into your admin account, which doesn't let you view lynda.com videos. For full access to the lynda.com library, log in through iplogin.lynda.com, or sign in through your organization's portal. You may also request a user account by calling 1 1 (888) 335-9632 or emailing us at cs@lynda.com.

Get access to all lynda.com videos

You are currently signed into your admin account, which doesn't let you view lynda.com videos. For full access to the lynda.com library, log in through iplogin.lynda.com, or sign in through your organization's portal. You may also request a user account by calling 1 1 (888) 335-9632 or emailing us at cs@lynda.com.

Access to lynda.com videos

Your organization has a limited access membership to the lynda.com library that allows access to only a specific, limited selection of courses.

You don't have access to this video.

You're logged in as an account administrator, but your membership is not active.

Contact a Training Solutions Advisor at 1 (888) 335-9632.

How to access this video.

If this course is one of your five classes, then your class currently isn't in session.

If you want to watch this video and it is not part of your class, upgrade your membership for unlimited access to the full library of 2,024 courses anytime, anywhere.

learn more upgrade

You can always watch the free content included in every course.

Questions? Call Customer Service at 1 1 (888) 335-9632 or email cs@lynda.com.

You don't have access to this video.

You're logged in as an account administrator, but your membership is no longer active. You can still access reports and account information.

To reactivate your account, contact a Training Solutions Advisor at 1 1 (888) 335-9632.

Need help accessing this video?

You can't access this video from your master administrator account.

Call Customer Service at 1 1 (888) 335-9632 or email cs@lynda.com for help accessing this video.

preview image of new course page

Try our new course pages

Explore our redesigned course pages, and tell us about your experience.

If you want to switch back to the old view, change your site preferences from the my account menu.

Try the new pages No, thanks

site feedback

Thanks for signing up.

We’ll send you a confirmation email shortly.


By signing up, you’ll receive about four emails per month, including

We’ll only use your email address to send you these mailings.

Here’s our privacy policy with more details about how we handle your information.

Keep up with news, tips, and latest courses with emails from lynda.com.

By signing up, you’ll receive about four emails per month, including

We’ll only use your email address to send you these mailings.

Here’s our privacy policy with more details about how we handle your information.

   
submit Lightbox submit clicked