IntroductionWelcome| 00:00 | (music playing)
| | 00:04 | Hi! I'm Barton Poulson, and I'd like to
welcome you to Up and Running with R. R is an
| | 00:09 | open source program and programming
language that has become one of the most
| | 00:13 | powerful choices available
for statistical analysis.
| | 00:16 | In this course, I'll teach you to use
charts, such as histograms, bar charts,
| | 00:21 | scatter plots, and box plots to
get the big picture of your data;
| | 00:25 | descriptive statistics, such as means,
standard deviations, and correlations for a
| | 00:29 | more precise depiction; inferential
statistics like regression, T-tests, the
| | 00:34 | analysis of variance, and the chi-
square test to help you determine the
| | 00:37 | reliability of your results.
| | 00:40 | Finally, I'll demonstrate how you can
create beautiful charts for presentations,
| | 00:44 | and share your results with other people.
| | 00:46 | If you're ready to get going, let's
get started with Up and Running with R.
| | Collapse this transcript |
| Using the exercise files| 00:00 | If you're a premium member of the
Lynda.com Online Training Library, then you
| | 00:04 | have access to the exercise
files used throughout this title.
| | 00:08 | The exercise files are contained in
a folder, and there is one R project
| | 00:12 | folder for each movie.
| | 00:13 | Inside the R project folder, you'll
find the R file, and any other files needed
| | 00:18 | to follow along with the movie.
| | 00:20 | If you're a monthly subscriber, or an
annual subscriber to Lynda.com, then
| | 00:24 | you don't have access to the exercise
files, but you can follow along from
| | 00:27 | scratch with your own data.
| | 00:28 | And with that, let's get started.
| | Collapse this transcript |
|
|
1. What is R?R in context| 00:00 | Before we get started working with data,
I wanted to take a couple of minutes
| | 00:04 | and give a little background on R, and
some context on how it's used today.
| | 00:09 | This is a good thing to do, because for many
people, R is something of a mythical beast.
| | 00:14 | They have heard of it, and they have
been told that they should use it, but they
| | 00:18 | don't really know why or how.
| | 00:20 | The problem is that it's very hard to
leave the comfort of familiar approaches
| | 00:23 | to data, like SPSS, or SAS, or even more
frequently, Excel, without understanding a
| | 00:29 | little better what's to
be gained by the exercise.
| | 00:32 | Let me start out with a little bit of history.
| | 00:34 | R was originally developed by Ross
Ihaka and Robert Gentleman who were both
| | 00:39 | statistics professors at the
University of Auckland in New Zealand.
| | 00:43 | Ross wrote a short paper
about the history of R called,
| | 00:46 | R: Past and Future History. That is
available on the R project Web site.
| | 00:50 | Their original goal was to develop
software for their students to use, but when
| | 00:55 | they made their first public announcement
of R's development in 1993, they were
| | 01:00 | encouraged to make it an
open source project.
| | 01:02 | Now, as a note, let me just add that R
is not a statistics program per se, but a
| | 01:07 | programming language that works very
well for statistics, and was developed with
| | 01:10 | that purpose in mind.
| | 01:11 | It was based on S, another single letter
programming language that was also developed.
| | 01:16 | It was statistical analysis, and which still
exists, primarily in its incarnation as S+.
| | 01:22 | Anyhow, early alpha versions of R were
released in 1997, version 1.0 came out in
| | 01:28 | 2000, and 2.0 came out in 2004.
| | 01:32 | Version 3.0 is due in mid 2013.
| | 01:35 | What's most fascinating to watch is
the growth of R, especially compared to
| | 01:39 | programs like SAS, which goes back to
the mid 60s, and which has a substantial
| | 01:44 | corporate structure around it, or SPSS,
which was also developed in the 60s, and which
| | 01:49 | is now owned and developed
by the industry giant IBM.
| | 01:52 | The wonderful r4stats.com Web site
which is maintained by Robert A. Muenchen
| | 01:58 | releases data annually on the
popularity of several statistical packages,
| | 02:02 | including R, SAS, SPSS,
Stata, and several others.
| | 02:07 | And I'll just remind all of us one more
time that unlike SAS and SPSS, which are
| | 02:13 | very expensive, and can have very
restrictive licensing requirements, R and all
| | 02:18 | of its packages are free, and open
source for anyone to download and use.
| | 02:23 | It's true, though, that a lot of people
are intimated by the fact that R is a
| | 02:27 | command line programming language, and
that they feel much more comfortable with
| | 02:31 | dropdown menus, and dialog boxes.
| | 02:33 | Fortunately, there are several free
programs and packages that run as layers or
| | 02:38 | shells over R that can provide
just that kind of experience.
| | 02:42 | However, as the programmers like to
say, the command line interface may not
| | 02:46 | really be a bug, but instead, a feature.
| | 02:49 | That is it makes it much, much easier to
keep an explicit record of what actions
| | 02:54 | were performed in an analysis,
and to repeat them in the future.
| | 02:57 | It also makes it easier to share
those analyses with others, which make
| | 03:01 | collaboration much easier.
| | 03:03 | Also, it can facilitate the integration
of R with other programs and languages,
| | 03:08 | such as packages that allow R to work
both ways with Excel -- that is, you can run
| | 03:13 | Excel from R,
and you can run R from Excel --
| | 03:16 | and even integrate R with SAS, and SPSS.
| | 03:20 | And so, for all of these reasons, R
should be more than just a shadowy
| | 03:24 | possibility for most people. Instead,
as this course will show you, R can be
| | 03:29 | easy, it can be informative, it can
be fast, and believe it or not, it can
| | 03:34 | even be fun.
| | Collapse this transcript |
|
|
2. Getting StartedInstalling R on your computer| 00:00 | R is a free download that's available
for Windows, Mac, and Linux computers, and
| | 00:05 | installation is a simple process.
| | 00:07 | The first thing you need to do is go
to the R Web site; that's r-project.org.
| | 00:13 | From there, you can scroll down to
where it says Getting Started, and you see
| | 00:18 | download R. I'm going to
click on that right now.
| | 00:21 | When you click on that, you get to
choose what's called a CRAN Mirror.
| | 00:25 | CRAN stands for the Comprehensive R
Archive Network, and what these are are
| | 00:30 | servers that have identical copies of
all of the R information, and it's usually
| | 00:34 | helpful to find one that
is physically close to you.
| | 00:37 | I'm going to scroll down to the United
States, and I'm close to UCLA right now,
| | 00:44 | so I'm going to click on that one.
| | 00:47 | From there, you have three choices, depending
on the operating system of your machine.
| | 00:52 | You can download R for Linux, for Mac
or for Windows, and most people will be
| | 00:57 | downloading R for Windows.
| | 00:58 | Let's click on that one first.
| | 01:00 | From there you have
a few different choices.
| | 01:02 | The one that most people are going to
want is base, and then you can simply
| | 01:06 | download it by clicking on that top link.
| | 01:08 | I'm going to back up and
show you the Mac version.
| | 01:12 | If you click on Mac, then the one that
you want is this one right here that says
| | 01:17 | package. R2.15.2 is the current version.
| | 01:21 | Then one more; I'll back up, and for
Linux users, the version that you download
| | 01:27 | depends on the distribution
of Linux that you're using.
| | Collapse this transcript |
| Using RStudio| 00:00 | R is a very popular language for working
with data, but not everybody wants to
| | 00:05 | do their work in the R application.
| | 00:07 | Some people prefer having one window
that shows everything they need. Many
| | 00:11 | people prefer graphical user interfaces,
or GUIs, to command line programming.
| | 00:15 | In addition, the default interface for
R looks and acts somewhat differently in
| | 00:21 | each operating system, which
complicates courses like this one.
| | 00:25 | Fortunately, because R is open source,
a number of options to the standard R
| | 00:29 | environment have been developed.
| | 00:30 | The list is rather long, but I wanted
to just mention one in particular right
| | 00:34 | now, and that's RStudio.
| | 00:37 | If you go to the Web, and go to rstudio.com,
you have the option of downloading
| | 00:42 | RStudio, and what this is an IDE, or an
integrated development environment for
| | 00:47 | R. It's simply a layer that goes over the
top of R, which has to be installed separately.
| | 00:52 | It's a free download, so simply come
down here to the bottom left, and click on
| | 00:56 | Download Now, and then
choose the version that you want.
| | 01:00 | You can choose a different one, depending
on your operating system, or you can
| | 01:03 | even install RStudio
in a Web browser remotely.
| | 01:07 | We've already downloaded it and
installed it on our computer here, so once it's
| | 01:11 | installed, you'll have an
icon like this on your Desktop.
| | 01:15 | I'll double-click on that to open up
RStudio, and you see that what we have on
| | 01:19 | the left here is the R console, with the
exact same text that shows up when you
| | 01:23 | open R in the R application.
| | 01:26 | In fact, again, it's identical here,
the coding is identical; this is simply a
| | 01:30 | different arrangement, but it allows
consistency between Mac, Windows, and
| | 01:34 | Linux. That's important.
| | 01:37 | Also, it makes it easier to get to the
help information, the package information,
| | 01:41 | and other sources that we will
be using throughout the course.
| | 01:44 | Now, here in the console is the exact
same text that shows in the R console when
| | 01:48 | you open up the R application.
| | 01:49 | Again, that emphasizes that
RStudio is simply a layer over the top.
| | 01:54 | It allows you to have several windows
open simultaneously, organizes them, makes
| | 01:58 | it easier to deal with things like
packages, and the help, and the workspace, and
| | 02:02 | the history, which we'll talk about later.
| | 02:04 | But for right now, I want to make it
clear that this is the same program
| | 02:08 | accessing the files R;
interchangeable in either one.
| | 02:12 | There are a couple of advantages to R,
aside from the fact that it's consistent
| | 02:16 | from one platform to another.
| | 02:17 | For instance, it allows you to divide
your work into multiple contexts, each
| | 02:21 | with their own working directory,
workspace, history, and source documents.
| | 02:26 | And you do that by coming up to the top
right to Project, and creating different
| | 02:31 | projects with different settings.
| | 02:33 | Another one of the big advantages
of RStudio is that it has built in
| | 02:37 | GitHub integration.
| | 02:38 | So, if you're going to be using versioning,
this is a huge advantage in using this.
| | 02:42 | Also, it's easier to work with graphics,
especially in terms of exporting them in
| | 02:46 | several different formats,
and resizing them.
| | 02:49 | You will also have the possibility
for interactive graphics with the
| | 02:52 | manipulate package in RStudio.
| | 02:54 | There are a lot of other options
available for most graphical user interfaces,
| | 02:59 | and other kinds of interfaces that can
be laid over the top of R. Some of the
| | 03:03 | other ones are the R GUI; that comes
the precompiled version of R for Windows.
| | 03:08 | There's another one called R Commander.
There's RExcel, which allows you to use R
| | 03:13 | and R Commander from
within Microsoft Excel.
| | 03:16 | There's Revolution Analytics, which is
elaborately developed for enterprise
| | 03:20 | use and with big data.
| | 03:22 | If you want to see a more complete list
of what your options are, you can simply
| | 03:25 | go to the Wikipedia article on the R
programming language, which has a section on
| | 03:30 | graphical user interfaces.
| | 03:32 | You can also see the Journal of
Statistical Software from June of 2012 that
| | 03:36 | discusses GUIs for R.
| | 03:38 | RStudio can be one attractive option
among many for working with R. It's a good
| | 03:44 | idea to spend a little time exploring
the alternatives, so you can find what
| | 03:47 | works best for you,
and for your own projects.
| | 03:50 | With that in mind, we'll be using
RStudio throughout this course, because it
| | 03:53 | allows us to have consistency
between different platforms.
| | 03:56 | Those of you who've worked in Java or C++
may be familiar with Eclipse. That also
| | 04:01 | gives a consistent interface across
platforms, so this is a similar idea.
| | 04:05 | It's important to remember, though, that
RStudio is not a replacement for R, but
| | 04:09 | a layer over the top.
| | 04:10 | You still need to have R installed on
your machine, and RStudio will simply
| | 04:13 | access that application.
| | 04:15 | In addition, the files that you create
in the editor are saved in the native .R
| | 04:20 | file format, and are completely
interchangeable with R's default interface.
| | 04:23 | In fact, in creating this course, I
have used both RStudio and the default Mac
| | 04:28 | interface for R, so there's no
problem going from one to the other.
| | 04:31 | Whatever interface you use, you'll
still have the same incredibly rich and
| | 04:36 | flexible experience with R's considerable
powers, which is where we'll turn
| | 04:39 | in the next movie.
| | Collapse this transcript |
| Getting started with the R environment| 00:00 | Let's start by taking a
look at R when it first opens.
| | 00:04 | For this course, we'll be using the
RStudio interface, so I'll begin by
| | 00:08 | double-clicking that icon
on the Desktop.
| | 00:12 | If you want to use the default R
application, just double-click on
| | 00:15 | the appropriate icon.
| | 00:16 | R is the 32-bit version
for older computers.
| | 00:19 | R64 is the 64-bit version, which most
people will want, and that will become
| | 00:25 | the default in R3.0.
| | 00:27 | Either way, once they're
open, they appear identical.
| | 00:29 | Also, if you prefer to work in other
environments, you have other choices.
| | 00:33 | So, for instance, on a Macintosh, you can
open up the terminal, and access R that
| | 00:37 | way by simply typing the
letter R at the command prompt.
| | 00:41 | Similarly, in Linux, type R at the
command, or you can set it up to use the text
| | 00:46 | editor of your choice
through the preferences or options.
| | 00:49 | When you first open R,
what you get is the console.
| | 00:52 | That's what I have here on the left, and it
comes up with a bunch of boilerplate texts.
| | 00:56 | It tells me, for instance, the version
that I'm using, it gives information about
| | 01:01 | the license, it gives information about
contributors, and citation, and also how to
| | 01:05 | get some demos or help, and
how to quit R in the console.
| | 01:09 | In RStudio, it's easy to resize the
windows by simply dragging the dividing line.
| | 01:13 | Right here, I can make it smaller, or
larger, and while the console is where the
| | 01:18 | action happens in R, it's not the
place where you want to be working.
| | 01:22 | Instead, you want to be working in a
script environment, because you can save that.
| | 01:26 | Also, I want to clear the console first.
| | 01:28 | On Mac and PC in RStudio, that's
just Ctrl+L, or you can go up to Edit, and
| | 01:35 | down to Clear Console.
| | 01:37 | I'm going to use the Ctrl+L. It just
clears out all the text, and then I'm going
| | 01:42 | to open up a script.
| | 01:43 | Now, you can either open a new one by
coming up to File > New > Script, or you can
| | 01:52 | click on this Menu option
right here to create a new script.
| | 01:56 | I've already written a script for this
movie, so I'm going to open that by going
| | 01:59 | up to this icon right here
to open an existing file.
| | 02:02 | I'm going to come down to where I
have it; I'm in the Desktop; Exercise Files.
| | 02:09 | This is chapter 2, movie 3,
and there's the file.
| | 02:14 | I'm going to double-click on
that, and it opens up in RStudio.
| | 02:17 | Now, I want to point out that there's
a lot of code in this one, but almost
| | 02:22 | all of it is comments.
| | 02:23 | Anything that begins with the hash tag,
or the number sign, and shows up in a
| | 02:27 | light green here is a comment
and it's not run.
| | 02:29 | The actual coding is in the
blue and the grey, you'll see.
| | 02:33 | I can run each line here, and it will
show up in the console one at a time.
| | 02:38 | So, for instance, what I'm going to do
is I'm going to come down to line 4 where
| | 02:42 | I simply have 2 + 2 written,
| | 02:44 | and as long as I'm anywhere in that
line, on the PC I can hit Ctrl+Return, on
| | 02:50 | the Mac I hit Command+Return,
and it will run that line.
| | 02:54 | So, now what you see in the console on
the bottom is all in blue, 2 + 2, that's
| | 02:58 | the command that I wrote, and then it
included the comment after the hash tag,
| | 03:02 | and then beneath that, it gives
the output; the result of this one.
| | 03:06 | Now, you can tell the command, because it
appears after the command prompt, that's
| | 03:09 | the greater than sign, and the
response appears after this index number.
| | 03:13 | So, the one in the square bracket is the
index number for a vector. The idea is
| | 03:19 | that sometimes it puts out a whole lot
of numbers, and it gives you the index
| | 03:23 | number for the first number in that line.
| | 03:26 | In fact, I'll show you what it's
like if there's more than one line.
| | 03:29 | I'm going to come down to line number 6
in the script on the top where it says
| | 03:34 | 1:100, and what that's going to do is
it's going to print the numbers 1 to 100
| | 03:38 | across several lines.
| | 03:39 | The cursor is there, so I can just hit
Ctrl+Enter on the PC, or Command+Enter on
| | 03:44 | the Mac, and now you see
we have the index numbers.
| | 03:47 | The first line begins with index
number 1, the second line begins with index
| | 03:52 | number 17, and so on.
| | 03:54 | So, when you get your output, and you
get these little cryptic numbers in the
| | 03:57 | square brackets, that's just giving
you the index number for the vector
| | 04:00 | that it's dealing with.
| | 04:02 | Also, you may have noticed that
there's no command terminator on these.
| | 04:05 | For instance, I don't have to put a semicolon
or any other mark at the end of the command.
| | 04:10 | It simply does it one line at a time.
| | 04:13 | If I have a command that's going to go
more than one line, it's in parentheses,
| | 04:17 | and I'll have examples of
that later in this course.
| | 04:19 | A customary thing, also, whenever you're
learning a new language, like learning the
| | 04:23 | R programming language is to
learn how to write "Hello World!"
| | 04:27 | This one, because it's text, I just put
print, and then in parentheses, I put the
| | 04:31 | text that I want in quotes.
| | 04:33 | In this case, it's "Hello World!"
| | 04:35 | So, I press Ctrl+Return on the PC, Command+Return
on the Mac, and now I have my "Hello World!"
| | 04:42 | I'm going to scroll down a
little bit in this window.
| | 04:47 | Because R is a programming language
that was intended for working with data, it
| | 04:51 | also works very well with variables.
| | 04:54 | In line 11, I'm going to create a
variable called x, and I'm going to put into it
| | 04:59 | the numbers 1 through 5.
| | 05:01 | Please note I have an assignment
operator here; that is the <-, the arrow, and
| | 05:09 | that's often read as gets, and so I
would read this as x gets the numbers 1 to 5.
| | 05:15 | I'm going to bring the cursor down
there, and hit Ctrl+Return on my PC,
| | 05:20 | Command+Return on the Mac, and you see
now that I have x gets 1 to 5, and then it
| | 05:26 | tells me that it's run that command,
but also look off to right side, the top
| | 05:30 | right; you see there in the workspace,
it's telling me that I have now created a
| | 05:34 | variable called x. It's an
integer with five numbers in it.
| | 05:38 | If I actually want to see the numbers
that are in x, all I have to do is enter
| | 05:42 | the name of the variable, just x,
and then I've got this hashtag comment
| | 05:47 | after it that says display the values
in x. So, I'm going to hit Ctrl+Return
| | 05:51 | to run this line,
or Command+Return on the Mac.
| | 05:53 | Now you see that I have five numbers: 1,
2, 3, 4, 5, and then the index number
| | 05:58 | for the first one in the vector is 1,
which is why that appears at the
| | 06:00 | beginning of the line.
| | 06:02 | Also, if I want to have a set of
numbers that's not just sequential, but actual
| | 06:06 | data, I have the option of using a
function called concatenate. That's the C here.
| | 06:12 | This is in line 13.
| | 06:13 | I'm going to create a variable here
called y, and I'm going to specify the
| | 06:17 | values that I want in it.
| | 06:19 | This time it's 6, 7, 8, 9, 10, and I put
them in parentheses with the function c.
| | 06:24 | Again, that stands for concatenate, or
sometimes called combine, or collection,
| | 06:30 | because it puts them all
together into this one variable.
| | 06:33 | I have the cursor in line 13.
| | 06:34 | I'm going to press Ctrl+Return on the PC,
or Command+Return on the Mac, and you
| | 06:40 | see down in the console at the bottom,
I now have in blue that that command has
| | 06:44 | run, and if you look into the workspace
on the top right, you'll see that I now
| | 06:49 | have not just the variable x, which
has five values; I now have a variable y,
| | 06:55 | which is numeric values
that also has five values.
| | 06:59 | If I want to see what's in y, I can go
back to the script on the top left here.
| | 07:04 | My cursor is already at line 14, because
in RStudio, any time you run a command, it
| | 07:09 | bounces down to the next
line, which is convenient.
| | 07:11 | So, I'm going to press Ctrl+Return on
the PC, Command+Return on the Mac, and now
| | 07:17 | it shows me that I have these five values;
| | 07:19 | 6, 7, 8, 9, 10, where the index number
for the first one in the vector is 1.
| | 07:23 | One of the really neat things about R
is that it allows you to do vector-based
| | 07:28 | mathematics, which is a way of working
with what normally you'd call an array of
| | 07:32 | data, but it allows you to do operations
on them without having to specify for
| | 07:36 | loops, and so the code
can be much simpler here.
| | 07:40 | So, for instance, I have five numbers in
my variable x, I have five numbers in my
| | 07:45 | variable y, and if I want to add them
to each other, where the first one in each
| | 07:50 | one gets added, the second one in each
one gets added, because they have the same
| | 07:53 | number, all I have to do is
write x + y. So, here I'm in line 15.
| | 07:58 | I'm just going to press Ctrl+Return
on the PC, Command+Return on the Mac,
| | 08:02 | and this time, it not only shows me the
command, it automatically outputs the results.
| | 08:07 | That's because I'm not
saving it as a new variable.
| | 08:10 | I'm just running it.
| | 08:11 | So, here at the bottom of the console,
you see that I now have 7, 9, 11, 13, and
| | 08:15 | 15, and those are the sums of
the items in those two variables.
| | 08:20 | Also, if I want to simply multiply each
of the elements in x, I can do that by
| | 08:24 | writing x * 2, and it will do each
element, and it will output it that way.
| | 08:29 | The cursor is already in line 16 in the
script on the top left. I'm going to hit
| | 08:33 | Ctrl+Return to run that line on the PC,
Command+Return on the Mac, and you see
| | 08:38 | down in the bottom console, it shows that
it's run that particular command, x * 2,
| | 08:43 | and it's got the output here.
| | 08:45 | It's five numbers; the index number of the
first number is 1, and it goes 2, 4, 6, 8, 10.
| | 08:50 | I just want to mention a couple of
things about style and putting things together.
| | 08:54 | I showed you that the assignment
operator when you want to put values into a
| | 08:58 | variable is this arrowhead, and so you
say y gets the concatenation of 6, 7,
| | 09:04 | 8, 9, 10 in line 13.
| | 09:06 | It is possible to do
this with an equals sign.
| | 09:09 | R will run it, but that's
considered poor style.
| | 09:11 | In fact, there are several style
manuals that have been written for coding in
| | 09:16 | R. One of the more interesting ones is
written by Google, which is nice because
| | 09:19 | it's publicly available.
| | 09:20 | It's short and it's very clear.
| | 09:22 | I'm going to go to my
browser and show you that one.
| | 09:24 | We have Google's R Style Guide, which
talks about ways to name files, it talks
| | 09:29 | about indentation, and the brackets,
about assignment, and I suggest that as you
| | 09:34 | begin to write your own code in R, you
take a few minutes and go through this, so
| | 09:38 | you can write code that is more readable
by others, and will make better sense
| | 09:41 | for you, and run more smoothly in R.
| | 09:43 | I'm going to go back to R now, and I'm
going to come down to the bottom here,
| | 09:48 | and clear the console.
| | 09:49 | I don't need that information anymore.
| | 09:50 | I'm going to hit Ctrl+L to clear it.
| | 09:53 | Now, R is conceptually simple, and
because it's command line based, you don't
| | 09:56 | need a lot of menus.
| | 09:57 | It can be very helpful to keep a few
windows open simultaneously, such as we get
| | 10:01 | to do here in RStudio, where we have the
editor window, we have a Console window,
| | 10:05 | we also have an indication of the
variables that are active in the workspace, and
| | 10:09 | we have access to information on
packages, and help in the bottom right.
| | 10:12 | R is a conceptually simple language,
and it's conceptually simple program.
| | 10:16 | Because it's command line based, it's
easy to save the information here in the
| | 10:19 | editor, and share it with others.
| | 10:21 | I encourage you to take a little bit
of time to look at the style manual, to
| | 10:25 | find ways that you can write your
own code to make it easiest for you to
| | 10:29 | understand, and easiest
to share with others.
| | Collapse this transcript |
| Reading data from a spreadsheet| 00:00 | R is a flexible program that allows you
to get data into it in many different ways.
| | 00:05 | I'm going to start in RStudio here, and
I'm going to open up the script that I
| | 00:10 | wrote for this movie by clicking on
open existing File, by going to Exercise
| | 00:13 | Files, and opening Exercise 02_04.
| | 00:17 | We're going to try opening a single
dataset in several variations, through
| | 00:21 | several different routes that
researchers would commonly use.
| | 00:25 | The simplest, but not necessarily the
fastest way to get data into R is to enter
| | 00:30 | it in directly using the editor.
| | 00:33 | So, for instance, on line 4 here, I'm
going to create a variable called x, and
| | 00:37 | then I have the assignment, that's the
arrow, and then I'm going to assign the
| | 00:41 | numbers 0 through 10 into x. So,
that's read as x gets 0 through 10.
| | 00:47 | The cursor is there,
| | 00:48 | so I'm just going to press Ctrl+Return
on the PC, Command+Return on the Mac,
| | 00:53 | and you see two things have happened;
| | 00:54 | number one, in the console below, you
see that now in blue it says that it has
| | 00:59 | read this assignment, and has gone to
the command prompt on the next line.
| | 01:03 | On the top right of the window, under
Workspace, under Values, you see that
| | 01:07 | we've entered a variable called X as an integer
variable with 11 values; that's from 0 to 10.
| | 01:13 | Now, the next thing I'm going to do is
on line 5 -- and when you run a command,
| | 01:17 | the cursor in RStudio automatically
goes down to the next one -- I'm going to
| | 01:21 | just have a single letter here: x. That means I
want to print the contents of x in the console.
| | 01:27 | So, I'm going to hit Ctrl+Return on PC,
Command+Return on Mac, and then you see
| | 01:33 | we have 0 through 10,
| | 01:34 | and the one in the square brackets on
the right is the index number for the
| | 01:38 | vector of the first item on that line.
| | 01:40 | Now, there's only one line,
so it's just going to be 1,
| | 01:44 | but that's an indication
that we have the response here.
| | 01:46 | So, it's 0 through 10 on
the contents of that one.
| | 01:49 | So, that's one way to get data in;
| | 01:51 | if you have sequential data,
it's a super easy way to do it.
| | 01:54 | Let's say, on the other hand,
you don't have sequential data.
| | 01:57 | You have a range of numbers, but they are
different things, and they're not in order.
| | 02:01 | Well, that's what I have on line 7.
| | 02:03 | I'm going to create a variable called y,
then I have the assignment operator;
| | 02:06 | the arrow that's read as gets.
| | 02:08 | And then I have c, which is for a
concatenate, or you could also say
| | 02:11 | collection, or combine.
| | 02:12 | And then I have a series of numbers
that I've entered, with a space in between
| | 02:16 | them, and then a comment at the
end that says assigns these values to y.
| | 02:20 | The cursor is in that line, so
I'm just going to press Ctrl+Return,
| | 02:24 | and on the console, you see
that it has read that command.
| | 02:27 | And on the workspace on the top right,
you see that I now have another variable,
| | 02:30 | it's y, that has got numeric values,
and there's 10 of them in this case.
| | 02:34 | I'm going to run the command on line 8
of the editor, that's just the letter y, to
| | 02:39 | see what's in it, and they are printed
out in the same order that had appeared
| | 02:43 | in when I entered it.
| | 02:44 | I'm going to do one other here, and now
you see that the cursor has moved down
| | 02:49 | to line 10. That's ls.
It's for list.
| | 02:51 | It's for listing the objects.
| | 02:52 | And if I enter that one, it's a way of
seeing what's going on in the program.
| | 02:56 | I'm going to hit Ctrl+Return, or Command+
Return on the Mac, and you see it tells
| | 03:01 | me I have two objects there;
x and the y.
| | 03:03 | Now, that's the same information that's
in the top right window under Workspace.
| | 03:07 | And in fact, that's one of the nice
things about RStudio is having this
| | 03:10 | Workspace browser right there, so you
don't even need to do this normally.
| | 03:14 | Now, what I'm going to do is I'm
going to try to read data from a CSV file.
| | 03:18 | The idea is that most of the time when
people have data, you're not going to
| | 03:22 | want to enter it one number at a time,
one line at a time in R. That's tedious,
| | 03:27 | it's inefficient, and it's hard to
get the structure of the data there.
| | 03:30 | Instead, most of the time, it's easier
to take data that's in the spreadsheet
| | 03:33 | format, where you have rows and columns;
| | 03:35 | one column per variable, and one row
per case for individual or observation.
| | 03:41 | The most common way of doing this is in an
Excel spreadsheet, or some other spreadsheet.
| | 03:45 | And while there are packages that are
designed to make it possible to read Excel
| | 03:49 | spreadsheets directly into R, I've found
them to be rather cumbersome to use, and
| | 03:53 | they don't always
produce the desired results.
| | 03:55 | On the other hand, the simplest way in
the world is to use what's called a CSV
| | 04:00 | file; a comma-separated value
file, which you can create in Excel.
| | 04:04 | In fact, to show how this works, what
I'm going to do is I'm going to minimize
| | 04:08 | RStudio for a moment,
| | 04:09 | and right here on my Desktop you see
I've got a folder that has the exercise
| | 04:14 | file -- that's the script
that I'm working on --
| | 04:16 | and I have two data files.
| | 04:18 | One is a Microsoft Excel spreadsheet,
it's called social_network, and the other
| | 04:23 | one is an SPSS data document.
| | 04:26 | I'll get to that one in a minute.
| | 04:27 | Because in R you're going to have to
give references to the specific file
| | 04:31 | locations, it's often easiest to move
these things to the Desktop, and that's
| | 04:34 | what I'm going to do right now.
| | 04:36 | I'm going to grab both of these files,
and just slide them over to the Desktop.
| | 04:40 | I'll put them back into the folder afterwards.
| | 04:42 | Then I'm going to open up the spreadsheet
by double-clicking on it, which
| | 04:46 | brings it up in Excel.
| | 04:47 | Now, here's the spreadsheet.
| | 04:48 | What we have in this version is 5 columns.
| | 04:51 | The first is an ID number, the second
is Gender of the respondent, the third
| | 04:55 | one is the Age of the respondent, and
then the fourth and the fifth have to do
| | 05:00 | with the subjects of the survey, which was
about people's preferred social networking sites.
| | 05:04 | This is from about 3 years ago.
| | 05:07 | And then the last one is how often they
say they log in to that site each week.
| | 05:11 | Another thing to notice, and this is significant,
is that we have missing data in this one.
| | 05:16 | So, for instance, cell E6,
it's right up here;
| | 05:20 | the person said they did not have a
preferred site, and they did not provide
| | 05:23 | a number for times.
| | 05:25 | Also, in cell C8, this
person didn't provide their age.
| | 05:29 | Now, this is important, because while
it's true that most statistical analyses
| | 05:33 | are easier if you have a complete
data set, it's also true that complete
| | 05:37 | datasets are not always the case.
| | 05:39 | And so, I wanted to use this one,
because it shows some of the things that you
| | 05:43 | can do when you have incomplete data.
| | 05:45 | The first thing that I'm going to do
is I'm going to take this file, and I'm
| | 05:49 | going to save it as a CSV file;
| | 05:51 | that's comma-separated value.
| | 05:52 | I am going to come up to
File, and go to Save As.
| | 05:56 | When that comes up, I'm going to move
to the Desktop, because I want to save it
| | 06:00 | to the Desktop, and I'm going to come
down to Save as Type where it currently
| | 06:04 | says Excel Workbook, I'm going to
click, and go about halfway down to this one
| | 06:07 | that says CSV Comma-Delimited --
comma-separated, or comma-delimited.
| | 06:12 | Now, you also have a choice of saving
it as a Tab Delimited Text file, that's
| | 06:17 | this one right here, in which case it
would be a .txt file. That introduces some
| | 06:22 | extra complexities in getting things
into R in terms of you simply have to be
| | 06:26 | explicit in terms of whether you have
missing values, and what the separators are.
| | 06:30 | I find it easier to just use a CSV.
| | 06:33 | So, I'm going to come back to CSV, I'm going
to click on that, and save it to the Desktop.
| | 06:38 | I can just go right ahead;
| | 06:39 | it's true, it's going to
lose some of the formatting.
| | 06:42 | And I'm going to close that file,
say Yes, and Yes, and minimize Excel.
| | 06:49 | And now you see that I've
got this file right here.
| | 06:51 | This is an Excel CSV file.
| | 06:54 | Now what I can do is I
can open this one up in R.
| | 06:57 | I'm going to go back to RStudio now, and I'm
going to show the next few lines in the script.
| | 07:03 | It says CSV files.
| | 07:04 | The first thing is that R takes missing
data, which in Excel, or in SPSS is just a
| | 07:08 | blank, and it replaces it
with NA for not available.
| | 07:12 | Because we're using a CSV file, you
don't have to be specific about the
| | 07:15 | delimiters for missing data.
| | 07:16 | You don't have to say that if there
are two tabs in a row that's missing.
| | 07:20 | Also, CSV stands for
comma-separated values.
| | 07:23 | Another thing that I have to put into
this command is I have to specify that
| | 07:28 | there's a header across the top that
has the names of each of the variables.
| | 07:31 | Sometimes you'll have
those; sometimes you won't.
| | 07:33 | If you do, you need to tell it that
you have those, so it doesn't try to read
| | 07:37 | them as regular values.
| | 07:38 | And then there's an issue here
with backslashes in Windows PCs.
| | 07:42 | Let me show you this first command.
| | 07:44 | I'm going to go down to line 18, and
what I'm going to do is I'm going to take
| | 07:49 | this spreadsheet, and I'm going to
read it into what's called a data frame.
| | 07:53 | You can just think of that as a matrix
that holds data, although matrix and data
| | 07:57 | frame are actually different, because
in a matrix, everything has to be the same
| | 08:00 | data type, but in a data frame, the
columns can be of different types.
| | 08:04 | But I'm calling it sn, for social
network, .csv, because I'm using a
| | 08:08 | comma-separated value file.
| | 08:10 | That is the dot there.
| | 08:12 | Now, a lot of people associate
that with a method for an object.
| | 08:16 | The Google style manual for R that I
showed you suggested that you use a dot to
| | 08:20 | separate words in variable
names, and data frame names,
| | 08:23 | so that's what I'm doing here.
| | 08:25 | So, I'm creating a data frame called sn,
for social network, .csv, then I have the
| | 08:31 | operator, the arrow and the dash;
it means gets.
| | 08:34 | And then I'm using the function read.csv;
that's a built-in function here, and
| | 08:38 | then I have to specify the path.
| | 08:40 | Now, normally in a Windows computer, the
path looks like this, and unfortunately,
| | 08:44 | paths get really long, and I'm
being explicit about the entire path,
| | 08:48 | so I have C:\\Users\\Barton Poulson\\
Deskotp\\social_network.csv, and then I
| | 08:55 | have this little thing
header = T; header = true.
| | 08:59 | There is a header in there.
| | 09:00 | Now, the problem is if, I run that one --
I'm just going to have the cursor right
| | 09:05 | here, and I'll hit
Ctrl+Return here on my PC,
| | 09:07 | and watch what happens.
| | 09:08 | I get an error message, and that's
because in R, when it gets a backslash, it's
| | 09:12 | trying to read that as what's called
an escape character that it uses for
| | 09:16 | reading special characters, like
line returns, or quotation marks.
| | 09:21 | And so, there's two ways
of dealing with that.
| | 09:23 | One is either you double up the
backslashes, so what you're actually doing is
| | 09:27 | you are called escaping the backslash.
| | 09:29 | So, the first backslash says
something is coming that I need you to read a
| | 09:33 | special way, and the second
one means it's a backslash.
| | 09:36 | If I do that, let me come down to
line 20 here, and press Ctrl+Return on the
| | 09:41 | PC, and I get this.
| | 09:43 | Now, you see down at the bottom it says
that it's read it, and if you look after
| | 09:48 | the right in Workspace,
now at the top, we have data,
| | 09:50 | and it says sn.csv; again, that's
for social network CSV, and it's 202
| | 09:55 | observations on 5 variables.
So, it's read it.
| | 09:58 | The other option is here on line 22, and
what this one uses is forward slashes.
| | 10:03 | Now, Macintoshes use forward slashes, but
I wouldn't have the C there for the Mac.
| | 10:07 | But you don't have to rearrange
things on a Mac, because the forward slashes
| | 10:10 | are readable, and by using the forward slashes
even in the Windows PC path, it can work also.
| | 10:16 | So, I'm going to hit Ctrl+Return,
| | 10:18 | and you see down in the console
that it read that one as well.
| | 10:22 | It had the exact same name, so it just
overwrote the same dataset in the workspace.
| | 10:26 | I'm going to use this one other little
command here: str. That is for structure.
| | 10:31 | And structure is a nice way to double
check that things got entered the way
| | 10:35 | that you wanted them.
| | 10:36 | So, you put the name of whatever it
is you're checking right after it.
| | 10:39 | I'm doing the structure
of this data file.
| | 10:41 | So, I'm going to hit Ctrl+Return,
| | 10:43 | and what it tells me is
I have a data frame.
| | 10:46 | I'm looking down in the console with
202 observations of 5 variables, and it
| | 10:50 | tells me what the variables are.
| | 10:51 | It tells me what the possible values are,
and runs off the first several values,
| | 10:56 | and so that's a good way
of seeing what's going on.
| | 10:59 | So, that's how you want to read data from
an Excel spreadsheet; by saving it as a
| | 11:03 | CSV, and then using read.csv to get
it in after you've made any necessary
| | 11:08 | accommodations to
the file path address.
| | 11:10 | I also have information in this script
about how to read data from an SPSS file,
| | 11:14 | and we're going to look at
that one in the next movie.
| | Collapse this transcript |
| Reading data from SPSS| 00:00 | In our last movie, we looked at how to
get data out of an Excel spreadsheet, and
| | 00:04 | into R through a CSV file; a
comma-separated value file.
| | 00:09 | In this movie, I want to pick up and
talk about how to get data out of an
| | 00:12 | SPSS file, because that's a very common
statistical package used by a lot of researchers.
| | 00:17 | Now, I'm going to continue with the
same script that I have open, that's SPSS
| | 00:21 | right here at the bottom on line 25.
| | 00:23 | I'm going to scroll up a little bit.
| | 00:26 | Now, there's a couple of different ways
of dealing with SPSS in R. The one that
| | 00:31 | I'm going to recommend is actually to
use the exact same procedure we used with
| | 00:34 | Excel; to save it as a CSV file, and
then to import it using the read.csv.
| | 00:40 | I find this to be the simplest, and most
straightforward, and have the fewest errors.
| | 00:44 | The way that you want to do this is by
opening it up in SPSS, and then using the
| | 00:49 | special saving function
that they have there.
| | 00:51 | I'm going to minimize R, and we
will get into SPSS to do that.
| | 00:55 | I'm just going to come down to my
data file, and double-click on that.
| | 01:02 | What you see is that this one looks a
little different from the Excel spreadsheet.
| | 01:06 | It has an extra column.
| | 01:08 | The first column is the ID number.
| | 01:10 | The second column is the Gender of the
respondents, and that's written as text.
| | 01:14 | In SPSS, that's referred
to as a string variable.
| | 01:17 | The third column here, though,
is redundant with that one.
| | 01:19 | It's called Female, and
it's written as 0s and 1s.
| | 01:23 | I've done this because very often in
SPSS, the practice is to enter even
| | 01:27 | text variables as numbers, and then put labels
on top of those; associate them with those.
| | 01:33 | Now, I personally like to use 0, 1; an
indicator variable for gender where 1
| | 01:38 | indicates the person that is of that
specified gender, 0 indicates they're not,
| | 01:42 | because I find it a lot easier to
read those results for correlation
| | 01:45 | coefficient for regression.
| | 01:46 | And which one is 0 and which
one is 1 is completely arbitrary.
| | 01:51 | The first case in this one was a male,
so they got a 0; the second one was
| | 01:55 | female, so they got a 1.
| | 01:56 | You can see that I have
variable names that go over them.
| | 01:59 | If you come up to the bar, and click on
the fourth from the right, this one right
| | 02:04 | here, you see that it says Value Label.
| | 02:06 | If I click on that, then you see that
the Female, the 0s and 1s have male and
| | 02:10 | female that goes over the top of them.
| | 02:12 | So, I'm going to go over to file, come
down to Save As, and then from there,
| | 02:19 | I go to Save As Type.
| | 02:20 | Now, right now it says
the default .sav.
| | 02:23 | I'm going to come down to
Comma delimited, that's .csv.
| | 02:30 | Then you see up here that, that's the
existing one that I created in Excel.
| | 02:34 | In order to not overwrite that, I'm
actually going to change the name here
| | 02:38 | slightly, and add _.spss.
| | 02:41 | Then I'm going to click
Save, and minimize SPSS.
| | 02:48 | Now, the second CSV file is the
one that we just created in SPSS.
| | 02:52 | What I'm going to do is I'm going to
go back to RStudio now, and I'm going to
| | 02:57 | run this command right here; it's sn,
for social network, .spss.csv, and I'm
| | 03:03 | going to use the read.csv command.
| | 03:05 | You see it's mostly the same.
| | 03:06 | I'm going to just scroll to the end
here, and all I need to do is give the
| | 03:11 | exact file path, and then I need to
specify that it has a header for the
| | 03:15 | variable names at the top.
| | 03:16 | Go back to the beginning.
| | 03:18 | And I'm going to run
this one now. Just hit Run.
| | 03:20 | Now you see that it's run, and then
in fact, on the right, I now have data.
| | 03:25 | I have sn.spss.csv
and that's worked as well.
| | 03:28 | I could run the structure to
see exactly what it looks like.
| | 03:30 | Just run that command, and that
gives me a description of what it's like.
| | 03:35 | The CSV I find to be the easiest
and most direct way of doing this.
| | 03:39 | There are actually several packages of
code that have been developed to read
| | 03:42 | files like SPSS files directly into R
without translating them into CSV files first.
| | 03:48 | One of the most recent is called
foreign, for reading foreign formats.
| | 03:53 | There's something
interesting that happens.
| | 03:54 | I'm going to scroll down here.
| | 03:56 | Now, a package, known as a library in
most other forms -- we're going to talk more
| | 04:01 | about packages in the next movie.
| | 04:03 | The thing is, it's a little bundle of code that
adds functionality, but it has to be installed.
| | 04:07 | So, the first thing I'm going to do
is I'm going to install it, and this is
| | 04:12 | actually going to download it,
and put it into R.
| | 04:14 | So, I'm just going to
run that line; number 32.
| | 04:17 | Then you see on the bottom, I got a
bunch of text in the console that says that
| | 04:21 | it ran that command, that's in blue,
and then it installed it in the red, and
| | 04:25 | then gave me just some
final results in the black.
| | 04:28 | Plus, if you look in the bottom right
here under Packages, it's now installed
| | 04:31 | one here called foreign.
| | 04:32 | You notice it doesn't have a checkmark.
| | 04:35 | Now, I can check it myself.
| | 04:36 | I can do that manually.
| | 04:37 | But in order to keep a record of everything,
it's nice to do that with the script, I
| | 04:42 | come back up to the script, to
line 33, and I say library(foreign).
| | 04:46 | That's going to load it.
| | 04:48 | When I run that, you can also see those
checkmarks come on on the bottom right.
| | 04:52 | Then I'm going to use its own
special format here; sn.spss.
| | 04:58 | Then I have the .f to say
I'm using foreign.
| | 05:00 | That's just for me.
| | 05:01 | Then I have the function as read.spss.
| | 05:05 | Then I gave the file path.
| | 05:06 | At the end, I have to
specify two extra things.
| | 05:09 | One is to.data.frame.
| | 05:11 | That is, I'm taking this SPSS file,
and I'm saving it as a data frame.
| | 05:16 | That's how we store all our data. And
also that I wanted to use the value labels
| | 05:20 | instead of the numbers for the
numeric variables that have value labels.
| | 05:24 | I go back to the beginning,
and I'm going to run that line, and
| | 05:27 | you can see on the workspace on the
top right, I've now added another one;
| | 05:31 | sn.spss.f. There it is right there.
| | 05:34 | I'm going to run the
structure on that one.
| | 05:36 | Now, you can see that's all
loaded the way that I wanted to also.
| | 05:39 | So, this one is good, it works,
but I generally get warning messages.
| | 05:45 | The warning message is not problematic.
| | 05:47 | It still went ahead and loaded it.
| | 05:48 | It still did it the way I wanted.
| | 05:50 | On the other hand, I'm more comfortable
using the CSV, because I don't even have
| | 05:54 | to install or load a package
of code in order to do this.
| | 05:57 | And so regardless of how you get your
data into R, either by using a CSV file
| | 06:03 | from Excel, or from SPSS, or by using a
package like foreign to read it in, you're
| | 06:09 | going to have a lot of opportunities to
work with that, and that's what we will
| | 06:13 | discuss in the next chapter.
| | Collapse this transcript |
| Using and managing packages| 00:00 | R is a very powerful and flexible
program, even with its default installation.
| | 00:05 | The beauty of R, though, is it can go so
much further than its base version by
| | 00:10 | adding packages or bundles of code that
add functionality to R. At the moment,
| | 00:14 | the Comprehensive R Archive Network or
CRAN package repository lists over 4000
| | 00:21 | packages for R, all of which can
be freely downloaded and installed.
| | 00:25 | The creativity and functionality of
these packages is astounding, leading many
| | 00:30 | people such as myself to tell
others that R can do anything.
| | 00:35 | In this movie, I want to show you how to
find out about packages, how to install
| | 00:39 | them, and how to use them in R.
| | 00:42 | The first thing to do is to find
out about packages that are available.
| | 00:45 | On the bottom right of the screen here,
this is one of the nice things about
| | 00:50 | RStudio, is you have a list of
packages that are already available.
| | 00:53 | We start from the bootstrap functions
under boot, to classification, and it goes
| | 00:57 | all the way down to the utilities.
| | 01:00 | These are ones that are installed, but it
doesn't mean that they're loaded at the moment.
| | 01:04 | The checkmark means that they're
currently loaded, so the utilities and the
| | 01:07 | stats are the ones that are
loaded in this particular window.
| | 01:09 | Let's take a look at what
some of the options are.
| | 01:12 | I'm going to go to line 6 in the editor
window here, and browseURL; this opens
| | 01:18 | up the URL in a Web browser.
| | 01:19 | I'm just going to run that line, and it's
going to open up my default browser, and
| | 01:24 | there you have a large list of categories
of packages that are available. And
| | 01:29 | CRAN, again, stands for
Comprehensive R Archive Network.
| | 01:33 | You can pick a field that you're interested
in; say, for instance, under graphics,
| | 01:38 | and a huge number of choices.
| | 01:39 | One of the most popular, by the way, is
this one right over here: ggplot2. That
| | 01:43 | stands for the grammar of graphics.
| | 01:46 | That's a book, and this was
written to be based on that book.
| | 01:50 | It's an incredible package.
| | 01:51 | I'm going to go back to R.
That's a list of topics.
| | 01:55 | You can also see what's available by name.
| | 01:57 | In this case, I'm going
to go to a specific mirror;
| | 02:00 | the one that's at UCLA.
| | 02:02 | I'm going to run this line.
| | 02:04 | Here we have a very
long list of packages.
| | 02:07 | This is going to be the 4000
packages that are available.
| | 02:13 | And pretty much everybody should be able to
find something of utility for them in here.
| | 02:18 | The next step in line 9 is to
bring up the editor list of the
| | 02:21 | available packages.
| | 02:22 | So, those are going to be
the ones that I have already.
| | 02:25 | I'm going to just run that line.
| | 02:27 | What this does is it brings up a text
file in an editor window; we see right
| | 02:31 | up here, and this mirrors a lot of
what's over on the right, except it does
| | 02:35 | show ones that are invisible, like the base
that you couldn't turn on or off if you wanted to.
| | 02:40 | Close that, and say, what about the
packages that are currently active?
| | 02:43 | That is, the ones that
are already checked.
| | 02:45 | I can do that with search.
| | 02:46 | Just run line number 10, and then in the
console, it shows me the packages that are there.
| | 02:52 | It's got 11 listed.
| | 02:53 | Again, not all of these show up, because
some of them are invisible, like the
| | 02:57 | global environment, but also the
ones that are checked off on the right,
| | 03:00 | you'll see in this list.
| | 03:01 | Now, if I want to install a new package,
say I found one that I really liked,
| | 03:06 | there are a couple of ways to do this.
| | 03:09 | For instance, you can come up to the
menu, to Tools, to Install Packages.
| | 03:16 | It brings up this menu; that's one way
to do it. Or, you can use the Packages
| | 03:19 | window here on the right, and
just click the one that you want.
| | 03:22 | But personally, I find it easy to use
scripts, and one of the reasons for that
| | 03:26 | is that it makes the
procedure repeatable for other people.
| | 03:30 | And also, it means that you can run
them in larger source scripts, and they
| | 03:33 | can run automatically.
| | 03:35 | Now, one that I like is called psych.
| | 03:37 | And what I'm going to do is I'm
going to run this line on number
| | 03:41 | 18; install.packages.
| | 03:43 | That's the command to
download the package.
| | 03:46 | Then you have to put the name of it in
parentheses, and quotation marks. I am
| | 03:50 | gong to run that line; it's
going to download the package.
| | 03:54 | You see that's what we have here
on the bottom left in the console.
| | 03:57 | There's all this text, and it says
it ran the command, it downloaded the
| | 04:01 | package, it's been installed,
| | 04:02 | and in fact, if you go to the Packages
list on the right, and come down, you'll
| | 04:09 | see that psych is now installed.
| | 04:11 | It doesn't have a checkmark,
because it hasn't been loaded.
| | 04:14 | That's a separate procedure.
| | 04:15 | So, what I'm going to do is I'm going
to come to line 20, to library("psych").
| | 04:20 | Now, please note, the quotation marks
in library are not necessary, but Google
| | 04:25 | suggests them as a good format.
| | 04:26 | It's consistent with installing.
| | 04:28 | You use the command library to make a
package available when you're loading it
| | 04:32 | in a script, like I am right now.
| | 04:34 | On the other hand, if you've created
a function or a package, sometimes you
| | 04:37 | use instead require.
| | 04:39 | Both of them have the same effect of
loading the code that's in the package.
| | 04:44 | I'm just going to use library, because
that's the one that I use in scripts.
| | 04:47 | So, I'm going to run that line, and
then you see in the console that it ran
| | 04:52 | library("psych"), and then you see in
the window on the bottom right that I now
| | 04:55 | have a checkmark next to psych.
| | 04:57 | Require would do the same thing.
| | 04:58 | Now, if you want to see the documentation,
you can just come down here.
| | 05:02 | I put library(help = "psych").
| | 05:04 | That lets it know what I want
the help on. I run that line,
| | 05:08 | and it brings up a window in the editor.
| | 05:10 | It has a text description, and it has a lot
of the information about what goes into it.
| | 05:15 | It's pretty lengthy.
| | 05:18 | But you can get even more, and
in a different format, if you try a
| | 05:21 | different approach.
| | 05:22 | Instead of just doing this one, a lot
of programs, and psych is one of them,
| | 05:27 | have what are called vignettes, and these
really are just examples of how to use the package.
| | 05:32 | So, what I'm going to do right here is
I'm going to come to line 28, and I'm
| | 05:36 | going to use the command vignette,
then I'm going to specify it's for the
| | 05:40 | package psych, so package = "psych".
| | 05:43 | And if I run that, it brings up an
editor window with not much in it.
| | 05:48 | But if I do a small modification, and
say I want to browse vignettes, that's
| | 05:54 | going to open it up in a browser.
| | 05:56 | It's going to look like this.
| | 05:57 | Now, what I have is PDFs,
and R codes, and LaTeX.
| | 06:02 | I can hit on the PDF here, and now I
can see a PDF that is nearly 100 pages of
| | 06:09 | documentation on how to
use the psych package.
| | 06:13 | That can be downloaded and saved.
| | 06:15 | It can be searched.
| | 06:16 | It's a wonderful thing.
| | 06:17 | I'm going to go back to R. You can also
bring up a list of all of the vignettes
| | 06:21 | that are available in all of the
packages that are currently installed in R.
| | 06:26 | That's just vignette().
| | 06:28 | I'm going to run that line,
and here are all the ones;
| | 06:32 | we have displaylist, sharing, matrix,
| | 06:37 | and just as we did with the psych
vignettes a moment ago, if you want to have
| | 06:41 | interactive hyperlinked version of this,
you just use browseVignettes(). Now I
| | 06:48 | have the documentation for nearly
everything, including, for instance, Sweave.
| | 06:51 | Now, once you have packages
installed, it's important to remember that
| | 06:54 | everything gets updated frequently in
R, and so you're going to want to get
| | 06:59 | things updated, including
your packages that you use.
| | 07:01 | In RStudio, there's a few
different ways to do this.
| | 07:04 | You come up to Tools to
Check for Package Updates.
| | 07:07 | You can do it there.
| | 07:09 | You can also come over here, and just
click on the green circle to check for
| | 07:12 | updates, or you can just run
this command: update.packages().
| | 07:16 | Run that one,
| | 07:17 | and it lets me know that
there are some updates.
| | 07:20 | Cancel those for right now.
| | 07:23 | Then finally, if you have a package that
you no longer need, you have the option
| | 07:27 | of simply coming over here to the window,
unchecking it, and then clicking on the
| | 07:31 | X to get rid of it if you want, or
you can also use this one: detach.
| | 07:36 | That will also remove the
package so it's no longer active.
| | 07:39 | I'm just going to run that line.
| | 07:41 | Now you see that the checkmark
next to psych has disappeared,
| | 07:44 | and if I want to get rid of it
entirely, I just click on the X.
| | 07:48 | Anyhow, that's one way that you can
add extra functionality to R, and to give
| | 07:54 | you some more of the flexibility and power
to do almost anything that you need to do.
| | 07:59 | And again, like R itself, these are free,
they're open source, and they can make
| | 08:03 | your analytical life much
easier, and much more creative.
| | Collapse this transcript |
|
|
3. Charts and Statistics for One VariableCreating bar charts for categorical variables| 00:00 | Once the data are entered into R, the
first task in any analysis is to examine
| | 00:05 | the individual variables.
| | 00:07 | Now, the purpose of this task are threefold:
| | 00:09 | first, to check that the data were
entered correctly; second, to check whether
| | 00:14 | the data meet the assumptions of the
statistical procedures that you've planned
| | 00:17 | to use; and third, to check for any
potentially interesting, or informative
| | 00:21 | observations, or patterns in the data.
| | 00:24 | For a categorical variable, such as a
respondent's gender, or a company's economic
| | 00:28 | sector, that is, a nominal or an
ordinal variable, the easiest and most
| | 00:33 | informative way to check the
data is to make a bar chart,
| | 00:35 | and so that's where we turn first.
| | 00:37 | The unfortunate thing about R is that
it's not really set up to do bar charts
| | 00:41 | from a raw data file.
| | 00:43 | It wants to do them from a summary
data file, where you say, this is the
| | 00:47 | category, and this is how
many people are in that category.
| | 00:50 | On the other hand, if you have raw
data, where you're simply listing category
| | 00:54 | 1, 2, 1, 1, 2, 2, 2, there's an easy way
to work around it, and that's what I'm
| | 01:00 | going to show you here.
| | 01:01 | I'm going to be using the social
network data that I've used before, and I'm
| | 01:05 | going to get that loaded.
| | 01:06 | The way I'm going to do this is I'm
going to use the same read.csv function
| | 01:11 | that I've used before. That's because
I'm dealing with a comma-separated values
| | 01:16 | spreadsheet, and I'm going to feed it into
a data frame called sn, for social network.
| | 01:21 | I am going to set it up a little bit
differently, though, because you may recall
| | 01:25 | in the previous versions, I specified
explicitly the entire file path from C on.
| | 01:31 | I want to use a shortcut version.
| | 01:33 | I am going to show you how to set that up.
| | 01:35 | If you go up to Tools, down to Options,
one of the choices you have in the
| | 01:40 | General window is the
Default working directory;
| | 01:43 | that is, when you're not in a project
that explicitly puts it somewhere else.
| | 01:47 | Even though we have a little tilde here,
this actually is currently going to
| | 01:50 | my Documents folder,
| | 01:52 | but I'm going to go to Browse, and I'm
going to change it temporarily to the
| | 01:56 | Desktop, because I've copied
the files over to the Desktop.
| | 02:00 | Then I put Select Folder, and now
you see it has the C:/Users Barton
| | 02:05 | Poulson/Desktop, and I can just press OK.
| | 02:08 | And now I can just have a very short
version, where I give just the file name
| | 02:13 | without the entire file path.
| | 02:15 | I still need to use the read.csv,
| | 02:17 | I still need to say that I have a
header, but otherwise it's more
| | 02:20 | abbreviated than that.
| | 02:21 | So, I'm going to read that in right now,
and now that's loaded in, we can move
| | 02:26 | on to the next part.
| | 02:27 | You see in the console that it ran, and
you see on the top right under workspace
| | 02:31 | that I now have a data frame, sn,
202 observations with 5 variables.
| | 02:36 | What I have here now is a bunch of
comments that R doesn't work with raw data;
| | 02:39 | it can't do it directly from
the categorical variables.
| | 02:42 | We first have to create
a table with frequencies,
| | 02:45 | and I'm going to use a
table function to do this.
| | 02:49 | In line 25, this is where I create the table.
| | 02:52 | What I do is I specify the name of
the new table, and that's going to be
| | 02:57 | site, because I'm looking at the
Web sites that people say are their primary
| | 03:03 | social networking sites;
| | 03:04 | .freq for frequency.
| | 03:06 | And then I have the assignment
operator, gets, and then table is the function.
| | 03:11 | And then I am specifying in the
parentheses the data set, sn, that's my data
| | 03:16 | frame, with the dollar sign;
| | 03:18 | I use that to specify which
variable I'm using to create the table.
| | 03:22 | In this case, I'm using site.
| | 03:24 | Please note the capitalization.
| | 03:26 | R is capitalization sensitive.
| | 03:28 | You've got to make sure that the
capitalization is the same all the way through.
| | 03:31 | So, I'm going to run that command,
| | 03:33 | and now you see it ran down in the
console, and on the right, I now have values.
| | 03:38 | I have a table now
with 6 values in it.
| | 03:41 | What I'm going to do now is
create the default bar chart.
| | 03:44 | This is one where I simply take a
barplot, and I just run it exactly as it is.
| | 03:49 | So, that's barplot, and then you put the table
in there, site.freq, and then I run that one.
| | 03:56 | In the bottom right here, you see
that it's opened up, and there are a few
| | 03:59 | things that are going on.
| | 04:00 | Number one is it's gray.
| | 04:02 | It doesn't have any titles.
| | 04:03 | There's only every other label.
| | 04:05 | The scale only goes up to 80,
and there are some other issues.
| | 04:09 | You can see it bigger if you want to.
| | 04:11 | Just come down and click on Zoom.
| | 04:14 | Now it fills up the whole space,
and you can see all of the labels.
| | 04:17 | There are a lot of options within
barplot that allow you to control the color,
| | 04:21 | the font, the orientation,
the order; a ton of things.
| | 04:25 | I'm actually going to take just a
second here to show you how you can find
| | 04:29 | out more about that.
| | 04:30 | I've got, here on line 28, the
question mark, a space, and then barplot.
| | 04:35 | This is how you find help on any of R's
functions, and I'm just going to run that line.
| | 04:41 | Now you see it brings up the Help
window here that talks about all the
| | 04:45 | functions and the options available in barplot.
| | 04:47 | And so, I'm going to show you a few of these.
| | 04:49 | I'm not going to run through all of
them, because there's an enormous number,
| | 04:53 | especially because barplot feeds into
some other more general options, such as
| | 04:58 | this one here that talks about
graphical parameters, which gives you just an
| | 05:02 | incredible amount of control of
things you want to specify.
| | 05:06 | Mostly I want to show you
just this very basic one,
| | 05:09 | and I'm going to make a few variations on it.
| | 05:12 | The first thing I'm going to do, and I
think it's really important, is to put the
| | 05:16 | bars in descending order.
| | 05:17 | Unless there is some sort of
inherent and necessary order in your data, a
| | 05:20 | descending order is a
really convenient way to do it.
| | 05:23 | The way to do that is actually I have
to tell it that I'm going to be drawing a
| | 05:27 | barplot, and I'm going to be using this data,
| | 05:30 | but I want to order it according to
this variable, because theoretically
| | 05:34 | you could order it according to a
different variable, and then I'm going to
| | 05:38 | use a decreasing order.
| | 05:40 | So, decreasing = True.
| | 05:42 | So, I come over here, and
I'm going to run this line,
| | 05:45 | and now you see that it's in
decreasing order. That's good.
| | 05:48 | And if you want to see it bigger,
what we have here is a lot of people who
| | 05:52 | reported using Facebook.
| | 05:53 | The next biggest was people who said they
used None, but they still answered the survey.
| | 05:57 | Then, you can tell this is a few years
older, because we have people saying they
| | 06:01 | used MySpace, and then we have
LinkedIn, and Twitter with just a couple of
| | 06:04 | people each, and I'm willing to bet
that all those things have changed since
| | 06:07 | this data was first gathered.
| | 06:08 | I am going to close that window.
| | 06:11 | Now, it's better that it's in order,
but we still have an issue of the labels,
| | 06:15 | and the scale is not long
enough, and we have no titles.
| | 06:18 | I'm going to show you
some of these other things.
| | 06:21 | What I'm going to do first is I
often like to put bar charts horizontally,
| | 06:26 | because then the scale is in the same
direction that it is on a lot of other analyses.
| | 06:30 | So, what I do then is I'm going to do
barplot, and I'm still going to order
| | 06:34 | them, except I'm not doing them
decreasing, because it needs to be increasing
| | 06:38 | when you're dealing with horizontal,
because it starts at the bottom and goes up.
| | 06:41 | But this time I have
horiz, or for horizontal = True.
| | 06:44 | So, I'm going to run that command,
| | 06:47 | and now I have a horizontal one, but
| | 06:48 | you see I lost even more of the labels.
| | 06:51 | Now, I also want to do
something about the color here.
| | 06:53 | For instance, Facebook has a distinctive
color of blue associated with it, and
| | 06:57 | so it would be nice to
highlight it with that color.
| | 07:00 | So, what I'm going to do is I'm going
to come down here, and I need to create a
| | 07:04 | vector; a collection of color specifications.
| | 07:08 | And the way I do that is
I first give it a name.
| | 07:11 | So, it's like a new variable;
a new data frame.
| | 07:14 | I'm calling it fbba, for Facebook blue;
fbba, and then a, for ascending, because
| | 07:21 | if I were doing this as a vertical
bar chart, I need to go descending.
| | 07:25 | Then I have the assignment operator, and
that's the arrow, and then c is for concatenate;
| | 07:30 | sometimes collection, or combined.
| | 07:32 | And then, I'm going to
have six colors in here.
| | 07:36 | Five of them are going to be
identical; they're going to be gray.
| | 07:39 | And so, I could write gray, gray, gray,
gray, gray, or I can use this other
| | 07:43 | option; that's rep,
and that's for repeat.
| | 07:45 | And what I do is I put down rep, and
then I put in parentheses what it is I want
| | 07:51 | repeated, and I want the word
gray in quotation marks repeated.
| | 07:55 | And then after a comma, how many times I
want to repeat it, and I want it five times.
| | 08:01 | Then, after the comma, I can put the
last color that I want, and I am going to
| | 08:05 | do that one in particular way.
| | 08:07 | First off, in order to get the Facebook
blue, I want to specify it exactly, and
| | 08:11 | I've got what are called the RGB
codes; the red, green, blue codes.
| | 08:14 | And that's 59 for red, 89
for green, 152 for blue,
| | 08:17 | but I also need to tell R that I'm
working on a 0 to 255 8-bit color scale.
| | 08:23 | And so, that's what the maxColorValue
is for, and then I finish the command.
| | 08:27 | This is also the first time, I think,
that I've broken code across two lines.
| | 08:32 | The reason for that is this is a
long line of code, but it's all a single
| | 08:36 | command, and so this is one way of making it
easier to follow, by breaking it into pieces.
| | 08:41 | So, I'm going to highlight both of those
lines, and then hit Ctrl+Return to run them.
| | 08:46 | Now I'm going to do a modified
version of the barplot, where I'm adding this
| | 08:52 | bottom line here that says, col,
that's for color, and I'm saying use the
| | 08:57 | vector fbba, and I'll highlight the
whole thing, and I'm going to run it.
| | 09:01 | And you'll see that in my chart on
the bottom right, the top one, which is
| | 09:05 | Facebook, turned blue.
| | 09:06 | Now, it doesn't say Facebook,
because it's small.
| | 09:09 | If I click on Zoom, then you
can see that it's Facebook.
| | 09:13 | There are some other issues with this chart.
| | 09:14 | Number one, I'd like to turn
off the borders around the bars.
| | 09:17 | Also, I need titles;
| | 09:19 | I like to have a subtitle.
| | 09:21 | The scale on the bottom goes from 0 to
80, but the bars go farther than that,
| | 09:25 | so I'd like to change it,
so it goes up to 100.
| | 09:28 | I happen to know that the
maximum value is just under 100.
| | 09:31 | And that's why I'm adding several
other arguments to this function.
| | 09:36 | So, this is the same barplot function, and
I'm making a chart of the site frequency.
| | 09:40 | I'm going to order it by site frequency,
and this one says make it horizontal.
| | 09:46 | This one says use the Facebook color vector.
| | 09:50 | Borders = NA; that means no borders at all.
xlim; that's the limits for the X.
| | 09:56 | This one needs be its own little
vector, and so I have c, for concatenate, and I
| | 10:01 | say it goes from 0 to 100.
| | 10:04 | And then I have one that says
main, and that means the main title.
| | 10:08 | That one is kind of long.
| | 10:10 | I didn't want to break it across.
| | 10:12 | So, let me scroll through here.
| | 10:13 | And what I'm saying is
Preferred Social Networking Site,
| | 10:15 | and then the \n is a way of
inserting a line break in the middle of it.
| | 10:20 | So, there will be a second line to this
one that says, A Survey of 202 Users.
| | 10:25 | Then xlab at the bottom means the
label for x that's going to appear
| | 10:28 | underneath the scale.
| | 10:30 | So, when I highlight all of those lines,
and run them, you see now the borders
| | 10:36 | have gone away, the scale has extended
to a 100, I have a title on the top, and
| | 10:40 | I have a scale label on the bottom.
| | 10:42 | If I make this bigger, you can
then see all of the site names.
| | 10:47 | If I wanted to spend some more time
on this, I would turn the labels, the
| | 10:51 | Facebook, and None,
so that they were horizontal.
| | 10:54 | I would probably move
Other and None down to the end.
| | 10:57 | There are a lot of other
things that I could do here.
| | 11:01 | That's why you want to be able to
explore the options that come through boxplot;
| | 11:06 | that's why I had the
question mark, space, boxplot.
| | 11:09 | And then also the parameters that
are the general graphics parameters.
| | 11:12 | They give you an immense amount of control.
| | 11:14 | You can basically make this do whatever
you want, but this is an example of some
| | 11:18 | of the modifications that are possible.
| | 11:22 | There's just one other thing I want
to show, and that's how to export these
| | 11:25 | charts, because right now it's a chart
that's just inside R. You see right here,
| | 11:29 | we've got a really easy thing.
It says Export.
| | 11:31 | This is one of the advantages
of using RStudio.
| | 11:34 | I can say, for instance, save it as a
PDF, and I can tell it how big I want it.
| | 11:39 | Let's say I want it
to be 8 inches by 6 inches.
| | 11:43 | Then I can give that file a name:
snPlotpdf.
| | 11:50 | One of the great things about RStudio
is that it gives you options for
| | 11:53 | exporting your graphics.
| | 11:54 | So, for instance, let me
zoom in on this graphic.
| | 11:57 | We've got what we need there. I'm going
to close it, and I can export it as a PDF.
| | 12:03 | And that's something that the regular
version of R does, but also, I can save
| | 12:06 | the plot as an image, and I have a lot of
choices here, from PNG, JPEG, TIFF, and so on.
| | 12:12 | I can choose my own width and height,
which is hard to do in a regular version
| | 12:15 | of R. I can view it after I watch it, and make
it big enough so you can see all the labels.
| | 12:21 | Anyhow, I'm just going
to press Cancel right now.
| | 12:23 | The idea here is that you have a lot
of control over these bar charts, and
| | 12:27 | that RStudio in particular gives
you a lot of options for exporting and
| | 12:30 | sizing your charts.
| | 12:31 | That is really one of the first things
you want to do when you're dealing with a
| | 12:35 | categorical variable is to make the
chart so you get a feel for your data to see
| | 12:39 | how well you meet the assumptions, and
to see whether it got entered correctly,
| | 12:43 | and to lead in to the later
analyses that you're going to do.
| | Collapse this transcript |
| Creating histograms for quantitative variables| 00:00 | In the last movie, I started by
saying how important it was to screen the
| | 00:04 | variables as you enter them by making
charts as a way of checking that you
| | 00:09 | entered them correctly, that you are
meeting the assumptions of the statistical
| | 00:13 | procedures that you intend to use,
| | 00:15 | and a way of giving you an idea of what's
interesting or unusual in your data set.
| | 00:19 | We looked at bar charts, which are
good for categorical variables. When you
| | 00:22 | have a quantitative variable, something
that's an interval or ratio levels that
| | 00:26 | has been measured, like age, or time, or income,
then you want to use a different approach.
| | 00:32 | The two most common forms of
graphics you want to use in that case are
| | 00:36 | histograms, like bell curves, and box plots.
| | 00:39 | In this particular movie,
we're going to look at histograms.
| | 00:42 | Now, the nice thing about histograms is
that, unlike box plots, R has a built-in
| | 00:47 | function for this one that does not require
you to do any sort of pre-parsing of the data.
| | 00:51 | I'm going to use an example here of the
social network data that I've used before.
| | 00:56 | I'm just going to scroll down
here, and read in that data set.
| | 01:00 | You can see on the workspace I've got a
data frame, that's sn, for social network.
| | 01:05 | It's got 202 observations
with the 5 variables.
| | 01:08 | And then I just come right down here,
and I'm going to make a histogram of the
| | 01:13 | variable of age, so I'm going to
look at distribution of the age of
| | 01:16 | respondent, so I use hist, that's
the function, and within the parentheses, I
| | 01:21 | specify the data frame, that's sn, and then the
dollar sign, and then I give the variable name.
| | 01:28 | Now, I should mention, it is
possible to use something in R; a function
| | 01:32 | called attach, which means you attach
a data set, and then you can refer to it
| | 01:37 | in a short-handed way.
| | 01:38 | You can just give the variable names,
because it knows you're referring to that
| | 01:40 | particular data set.
| | 01:42 | The problem with attach is it
really sets the stage for a lot of really
| | 01:46 | unfortunate errors, where you have more
than one data set open, and that you get
| | 01:50 | confused about what's doing what.
| | 01:52 | And so, for instance, when I talked
about the Google Style Manual for R, they
| | 01:56 | just said don't use attach ever.
| | 01:58 | So, what I'm doing here is I'm explicitly
saying what the data frame is, and
| | 02:03 | what the variable is.
| | 02:04 | Anyhow, I'm going to make a histogram
of age, and all I have to do is run that
| | 02:08 | one line on line 15.
| | 02:10 | There we have the default histogram.
| | 02:13 | You see, for instance, it says histogram,
and then it gives my funny title there
| | 02:16 | on the top, and runs it again at the bottom.
| | 02:18 | And this is sort of an
outline version of what we have.
| | 02:21 | I'm going to make just a few
modifications to this; not very many.
| | 02:25 | I'm going to come down here, and what I'm going
to do is I tried once removing in the borders.
| | 02:30 | You can do that, but it
looks silly, so I left that out.
| | 02:33 | I'm going to change the color to a
beige color; actually, a very light color.
| | 02:37 | It shows that light beiges and yellows
are good at getting people's attentions
| | 02:41 | without being overwhelming.
| | 02:43 | You can specify colors in a few different ways.
| | 02:45 | This one is a named color, so I put col, for
color, and then in quotes I put the word beige.
| | 02:51 | That's referring to a specific one.
| | 02:53 | There's another way to refer to it, and
that is colors in R also have numbers
| | 02:58 | from 1 to 657, I believe,
and the beige is number 18.
| | 03:04 | The way that you would specify
it in that case is with this line.
| | 03:08 | I would put col, for colors, and I
say referring to the colors, the set of
| | 03:12 | colors, and then in the square
brackets, I just say index number 18.
| | 03:16 | That would get the same color, but I'm
going to make it beige, and then I'm going
| | 03:21 | to put a label on the top of title.
| | 03:23 | That's main, that's for the main title,
and it's a long one, so I'm just going to
| | 03:27 | scroll to the end here for a moment.
| | 03:29 | And the backslash n
breaks it into two lines.
| | 03:31 | I'm going to back to the beginning, and
then I'm going to have an X label at
| | 03:36 | the bottom that I'm going to put
underneath the age, where it's just going to
| | 03:40 | say age of respondents.
| | 03:41 | So, what I do now is I
highlight these lines, and I run those.
| | 03:44 | Now you'll see I have a little bit of a fill,
a bit, just to make it pop out a tiny bit.
| | 03:50 | I have an interpretable title at the bottom.
| | 03:53 | I've got a label under the age that
makes sense, and that's really enough
| | 03:57 | for what I need to do.
| | 03:58 | That a functional, useful histogram,
and again, like box plots, there's about a
| | 04:02 | million options that you can have
in terms of modifying a histogram in
| | 04:06 | particular, and the
graphics parameters in general.
| | 04:08 | You can explore those, but this
is sufficient for getting started.
| | 04:12 | By the way, I just wanted to add
something about R's color palette.
| | 04:16 | If you want to, you can actually see
the palette by going to this Web address.
| | 04:20 | I'm going to copy that, and I'm going to
go to a Web browser, and we get a large chart.
| | 04:26 | This is just the beginning of it that
talks about what all the colors are.
| | 04:30 | If you click on the PDF, it's several
pages long. It gives the numbers for colors,
| | 04:35 | and then sorts them, and then gives the
individual names for each one of them.
| | 04:39 | For instance, there's the beige
that I used just a moment ago.
| | 04:42 | You can also get the hex codes, and
the RGB codes if you want for that.
| | 04:47 | I'm going to go back to R now,
and just show one other thing.
| | 04:50 | By writing colors, that
refers to the array; an 18.
| | 04:53 | If I run that line,
see what it does down here is
| | 04:56 | it says that colors, number 18 is beige,
and then I can to also specify several
| | 05:01 | by putting them in a concatenated array.
| | 05:04 | When I do that, I run that, and it
tells me the colors of each one of those
| | 05:08 | numbers that I put in.
| | 05:10 | Anyhow, those are some of the options
that you can use in customizing your
| | 05:13 | histograms as a way of exploring the
quantitative data, and getting you ready
| | 05:17 | for further analyses.
| | 05:19 | In the next movie, we're going to look
at another chart that is very useful for
| | 05:22 | quantitative variable,
and that's the box plot.
| | Collapse this transcript |
| Creating box plots for quantitative variables| 00:00 | In the last movie, we looked at how you
can use histograms as a way of checking
| | 00:04 | the nature of a quantitative variable
to see whether it got entered correctly,
| | 00:09 | to see whether it meets the assumptions
of the statistical tests that you're
| | 00:13 | going to perform, and to look for
interesting or potentially informative
| | 00:16 | observations within that variable.
| | 00:18 | Another graph that I always create when I'm
looking at quantitative variables is a box plot.
| | 00:24 | A box plot is a shorthand way
of looking at the distribution.
| | 00:26 | It highlights outliers, and it gives
you an idea for what might be unusual or
| | 00:31 | exceptional in a distribution.
| | 00:33 | In this particular data set, I'm going to use
the same variables that I used in the last one.
| | 00:39 | I'm going to open up the social
network data again, and then I'm going to
| | 00:43 | come down to boxplot.
| | 00:44 | Again, the nice thing is this is a built-in
function, and it doesn't require any
| | 00:47 | preprocessing the way that we
had to do with the bar charts.
| | 00:50 | All I do is I say I want a boxplot, and
then I'm using the data frame, or the data
| | 00:55 | set sn, and the variable
Age in that one.
| | 00:59 | I'm just going to click that.
| | 01:01 | By default, it makes them
horizontal, and there are no labels.
| | 01:05 | However, you can see that the median
age -- that's the thick line through the
| | 01:09 | middle of the box -- is around 30, and
we go down to below 10 years old, and up
| | 01:13 | to about 70 years old.
| | 01:15 | I'm going to make a few quick
modifications of the boxplot.
| | 01:17 | Let's scroll down here.
| | 01:20 | The first thing I'm going to do is
I'm going to put some color in it.
| | 01:24 | I'm going to use the beige again.
| | 01:25 | It's enough to make the boxplot pop off
the page, but without being overwhelming.
| | 01:29 | I'm also going to add
notches to the box plot.
| | 01:32 | That's a way of actually doing a sort of
visual inferential test for the medians of boxes.
| | 01:37 | I'm going to make it horizontal, because
I like to have it in the same scale as
| | 01:41 | the other variables that I use.
| | 01:43 | I'm going to add a title across the top
with main, and it's going to be two lines.
| | 01:47 | It has Ages of Respondents, and then
the \n splits it into a second line.
| | 01:52 | Then we get Social Networking Survey
of 202 users, and then we're going to
| | 01:57 | back up a little more.
| | 01:59 | Then I'm going to have a label on
the x-axis for Age of Respondents.
| | 02:02 | When I highlight those lines, and click
run, now what I have is one that looks
| | 02:09 | much cleaner, and much easier to read.
| | 02:12 | We have Age of Respondents going across.
| | 02:14 | We have a title on the top, so we know
it actually is showing us this time.
| | 02:18 | Also, because it's stretched out the
long way, it's easier to see what's
| | 02:22 | going on in the boxplot.
| | 02:23 | The notches there require a
little bit of an explanation.
| | 02:25 | The dark black line in the middle of
the notch is the median; 50% of the scores
| | 02:30 | are above, 50% are below.
| | 02:32 | The notches indicate basically a
confidence interval based on the variation
| | 02:36 | within the distribution, and it can be
used compared to other distributions.
| | 02:40 | So, for instance, one of the options we
could have is to make a separate boxplot
| | 02:45 | of men, and another one of women, and
then we can compare the median age of men
| | 02:50 | and women, or we could make boxplots
for the ages of people who preferred
| | 02:53 | different social networking sites.
| | 02:56 | Also, the dotted lines are sometimes
called the whiskers, and they go to the
| | 03:00 | highest and the lowest non-outlier
scores in the distribution.
| | 03:04 | If we had outliers, the whiskers
would stop, and they would be marked with
| | 03:07 | separate circles as a way of highlighting
both that they are unusual, and
| | 03:11 | potentially ignorable, depending
on the purposes of our analysis.
| | 03:16 | Anyhow, I encourage you to try the box
plots to explore the alternatives that
| | 03:20 | are part of the box plot function
itself, and that carry in from the graphic
| | 03:25 | parameters, or the pair functions that
are available as well, the same way you
| | 03:29 | can with the bar charts,
and with the histograms.
| | Collapse this transcript |
| Calculating frequencies| 00:00 | When you're exploring your data to
make sure you meet your assumptions, or to
| | 00:04 | find interesting exceptions,
graphics are an excellent first step.
| | 00:08 | However, most analyses also require the
precision of numbers in addition to the
| | 00:12 | heuristic value of graphics.
| | 00:15 | Just as we started with graphics for
categorical variables, we'll also start
| | 00:18 | with statistics for categorical variables.
| | 00:21 | The most common statistics in this case
are frequencies, which is what we'll do first.
| | 00:25 | I am going to use the data set that I
have been using so far; social network.
| | 00:28 | I'm going to come down here, and
| | 00:31 | because I have it saved in my default
location, which I set to be the Desktop, I
| | 00:35 | can simply run this line
to read the CSV file.
| | 00:39 | I see in the console that
that command ran fine.
| | 00:41 | In the top right in the Workspace, I
see that I've now loaded the date set sn,
| | 00:45 | for social network; it's got 202
observations in 5 variables.
| | 00:49 | The next thing is to create the
default table, and this is a frequency table.
| | 00:54 | It does it in alphabetical order,
and it looks like this when I run it.
| | 00:58 | What we have is 93 people who
indicated that Facebook was their preferred
| | 01:03 | social networking site, 3 who did
LinkedIn, 22 to MySpace, and so on.
| | 01:07 | Now, this is adequate
for getting the numbers.
| | 01:09 | On the other hand, it would be nice to
be able to modify it in a particular way.
| | 01:14 | This is going to be easiest if I
save the table as its own data frame.
| | 01:18 | That's what I'm going to do in line 15.
| | 01:20 | So, I'm going to create a new data frame
called site.freq, or frequencies of the sites,
| | 01:25 | and I'm going to use it
making the same command here.
| | 01:28 | So, I'm just going to run it again.
| | 01:30 | Now you can see that I've created
this new data set, and in fact, that shows
| | 01:34 | up in the Workspace.
| | 01:35 | It is a table which has six values in it.
| | 01:37 | Now I'm going to print the
table just by writing its name;
| | 01:40 | just site.freq will print
the table, and there it is.
| | 01:43 | It looks exactly the same
as what I had before.
| | 01:45 | Now what I'm going to do is I'm going
to start modifying it just a little bit.
| | 01:50 | The first thing is I'm going to sort it.
| | 01:52 | Sorting is kind of a funny
thing when it comes to tables.
| | 01:55 | I'm going to sort it into itself.
| | 01:57 | I'm replacing this table
with a sorted version.
| | 01:59 | In line 18, you see that I have site.freq.
| | 02:02 | That's the name of the table.
| | 02:04 | Then I have the assignment operator,
the arrow dash that's read as gets.
| | 02:08 | Then I say it gets site.freq, but
then in square brackets, I put down that
| | 02:13 | I'm going to order it, and then in
parentheses, I put down the basis for the ordering.
| | 02:18 | In this case, I'm ordering it by
the only thing in there, site.freq.
| | 02:21 | The idea here is that you
could order it by another variable.
| | 02:25 | In this case, I'm also specifying that
I want to do it in a decreasing format.
| | 02:30 | That's why the decreasing equals T for true.
| | 02:32 | I'm going to run that command, and
we see that that run in the console.
| | 02:37 | The command is there.
| | 02:38 | Now I'm going to print the table
over again by just doing site.freq.
| | 02:43 | Now you see that it's sorted in order.
| | 02:45 | It started at Facebook again, then None.
| | 02:48 | It goes 93, to 70, to 22, to 11, and so on.
| | 02:52 | These are the counts, the
frequencies; how often each one occurs.
| | 02:55 | On the other hand, sometimes it's
helpful to have the proportions of the
| | 02:59 | percentages, and that's a very simple
thing to do with R's built-in table function.
| | 03:04 | I'm going to use the prop.table function.
| | 03:07 | That's proportions.table.
| | 03:09 | I'm going to say what I need the
proportions of, and that's site.freq, which I
| | 03:13 | saved as a table,
so it would work on this one.
| | 03:16 | I'm just going to run that command, and
| | 03:17 | now you see that I have the same labels --
Facebook, None, MySpace -- in order, and
| | 03:23 | I have proportions under them.
| | 03:24 | Proportions go from 0 to 1,
where 0 is 0%, and 1 is 100%.
| | 03:29 | Now, the one problem with this list is
that I've got way too many decimal places.
| | 03:34 | If I want to get it down to just two
decimal places, I've got just one more
| | 03:38 | command I'm going to run here.
| | 03:39 | I'm going to take the command I just
ran in line 21, and I'm going to wrap it
| | 03:44 | with around, and that tells me that I
want to round it, and then at the very end
| | 03:48 | of that, you see that I have comma, 2;
| | 03:50 | that means two decimal places.
| | 03:51 | So, I'm going to run that command.
| | 03:53 | That's basically how I want it to look.
| | 03:55 | Now what I have is proportions.
| | 03:58 | So, it says that 46% of the respondents
indicated that Facebook was their preferred
| | 04:03 | social networking site.
| | 04:05 | In this particular date set,
1% chose LinkedIn or Twitter.
| | 04:09 | Depending on your proposes, you may
want to report the proportions, or you may
| | 04:13 | want to report the counts,
or frequencies up here.
| | 04:16 | Usually, actually,
you would want to do both.
| | 04:19 | The nice thing is that the table
command in R makes it simple to do both
| | 04:22 | of those.
| | Collapse this transcript |
| Calculating descriptives| 00:00 | In the previous movie, we
looked at descriptive statistics for
| | 00:03 | categorical variables.
| | 00:05 | In this one, we'll look at some
common, and not so common statistics for
| | 00:09 | quantitative variables, using both R's
built-in functions, and some specialized
| | 00:13 | functions from code packages.
| | 00:16 | To do this, I'm going to use the
same data set: social_network.csv.
| | 00:20 | I'm going to come down here, and run
line 12 to load it into a data frame called
| | 00:25 | sn, for social network.
| | 00:27 | We see in the console that that
command ran, and on the Workspace on the
| | 00:31 | right, that that's loaded.
| | 00:33 | The first thing I'm going to do is
simply get the default summary for the
| | 00:37 | variable age, which is
the age of the respondents.
| | 00:41 | I'm just going to run line
13 here, which says summary.
| | 00:45 | Then I'm specifying the data frame sn, and
then the dollar sign is for the variable age.
| | 00:52 | By default, what I get is the minimum value.
| | 00:55 | So, apparently somebody who said they
were six years older responded to this
| | 00:59 | online questionnaire.
| | 01:00 | Then I have the first quartile value,
which is the lowest 25%, then the median,
| | 01:06 | which is 28 years old, then the mean,
which is 31.66, the third quartile, and
| | 01:12 | then the maximum, which is 70, and 12
people did not respond to the question,
| | 01:17 | so we have NA's for not available.
| | 01:20 | An even a quicker way to do this is to
get the summary statistics for the entire
| | 01:24 | data frame at once; the entire table,
including the categorical variables.
| | 01:28 | To do this, all I have to do is run a summary,
and then give, in parentheses, the data frame.
| | 01:33 | Don't even specify a variable.
| | 01:35 | So, I'm going to run line
14 right now to do that.
| | 01:38 | I'm going to scroll;
| | 01:40 | I'll make this bigger by
clicking on that right there.
| | 01:43 | What you see is we have five variables.
ID, now, ID is just a sequential one that
| | 01:48 | goes from 1 to 202, so we
can actually ignore that one.
| | 01:51 | For gender, you see that I have
one person who did not respond.
| | 01:54 | I have 98 who said they were
female, and 103 who said they were male;
| | 01:58 | nearly evenly split.
| | 02:00 | Then I have my age statistics.
| | 02:01 | Those are the same as the ones I have
right above. Then the number of people who
| | 02:05 | chose each of the Web sites for
their preferred social networking site.
| | 02:10 | Then the number of times that
they say they logged in per week.
| | 02:12 | This one is an interesting variable,
by the way.
| | 02:14 | We have a lot of people who said
they logged in 0 times, and the 25th
| | 02:19 | percentile score, the first quartile, is 1 time
per week, but take a look at the maximum score.
| | 02:24 | There was one respondent who said that
he logged in 704 times per week, which is
| | 02:29 | physically possible; we did the math.
| | 02:31 | It's once every 10 minutes for every
waking hour during the week, and then 31
| | 02:35 | people did not respond.
| | 02:37 | So, this is actually a beautiful summary
thing, because it does all the variables,
| | 02:41 | both quantitative and
categorical, in the data set all at once.
| | 02:45 | Now I'm going to shrink this one back
down, and I'm going to do just a couple
| | 02:50 | of other variations.
| | 02:51 | One is something that we saw pretty much here.
| | 02:54 | There's something called
Tukey's five number summary.
| | 02:58 | We basically have it right here.
| | 02:59 | If we come down to the bottom here, you
see that we have the minimum, the first
| | 03:03 | quartile, the median, the mean,
the third quartile, and the max.
| | 03:06 | If you remove the mean, then you
actually have the five number summary, but this
| | 03:10 | is a really condensed version of it.
| | 03:13 | So, I'm going to do it just for the age,
and run line 20 here, and there we have it.
| | 03:19 | I'm going to make the console
bigger here, so you can see age.
| | 03:23 | We have 6, 21, 28, it
rounds off to 41, and then to 70.
| | 03:29 | Now, the pros and cons of the
five number summary;
| | 03:31 | the pro is that it's very compact.
| | 03:33 | It also is nice that it rounds off, and
that we don't have the decimal places.
| | 03:37 | The problem is, of course,
that it's not labeled at all.
| | 03:40 | You have to know what these things
are; that they're the minimum, the
| | 03:44 | first quartile, and so on.
| | 03:45 | I just want to make it aware, though,
that this is an option, and it's something
| | 03:49 | that's used -- these are the values that
are used when drawing box plots that we
| | 03:53 | did in the other chapter.
| | 03:54 | Now I'm going to use some alternative
descriptive statistics; a really big
| | 03:57 | set of statistics that includes the mean, the
standard deviation, the median, the 10% trimmed mean;
| | 04:05 | an unusual one: the median absolute
deviation from the median, the minimum,
| | 04:09 | maximum, range, skewness,
kurtosis, and the standard error.
| | 04:12 | I can get all of these at
once by using the package psych.
| | 04:17 | This is an external package, and so we need to
download it, but all I have to do is run line 31.
| | 04:25 | We wait a moment, and now
it says that it's installed.
| | 04:28 | In fact, if I click over here on packages,
scroll down a little bit, you'll see
| | 04:32 | that psych is now there.
| | 04:34 | It's not checked off,
because I haven't loaded it yet.
| | 04:37 | I have installed it,
but I haven't loaded it.
| | 04:39 | If I run line 32, which says library
("psych"), that will load it, and you see
| | 04:43 | that it's now checked over there.
| | 04:45 | I could also check it manually, but I
like using the script, because it keeps a
| | 04:49 | record of everything that happens.
| | 04:50 | Now all I do is I use the function
describe, and I run it for the entire
| | 04:57 | data frame again, so it's 33.
| | 04:58 | I'm going to just run 33, and
make the console bigger here.
| | 05:04 | Now what you have is the five
variables listed on the left, so
| | 05:10 | ID, Gender, Age, Site, and Times.
| | 05:12 | Now, let me make something
really important here;
| | 05:16 | two of these are categorical variables.
| | 05:19 | Gender and Site are categorical,
and they have asterisks next to them.
| | 05:24 | It is, however, still going through and
calculating numerical summaries for them.
| | 05:28 | What its doing is it's taking the levels,
and it's putting them down as one, two,
| | 05:33 | three, and so on, and there are times
when, even though it's categorical, these
| | 05:38 | kinds of summaries can make sense.
| | 05:40 | So, for instance, if it's an ordinal
variable -- first, second, third -- then there
| | 05:45 | can be times when averaging
them makes sense.
| | 05:48 | If it's a dichotomous variable, like Gender,
for instance, if I were to code it as 0
| | 05:53 | and 1, even though that's a category,
if you get the mean, it tells you the
| | 05:57 | proportion of people who have ones.
| | 05:59 | Now, because I have a missing value in
this one, the missing value gets the one,
| | 06:04 | the male gets the two, and
the female gets the three.
| | 06:08 | What you do see here is it tells me that
I'm about evenly split, but anyhow, the
| | 06:12 | important thing here is I've got my
five variables listed down the side, and I
| | 06:16 | have all of these things here.
| | 06:18 | I have options for controlling the
way that the median is calculated.
| | 06:21 | I have options for adjusting the
level of trimming on the trimmed mean.
| | 06:26 | I have options for controlling the
way that the skewness and kurtosis are
| | 06:29 | calculated, but this is also a very nice
format for coming -- this is the sort of
| | 06:34 | thing that I could copy and paste into
a paper, and just adjust the font size a
| | 06:40 | little bit; get it all on one line.
| | 06:41 | So, this is a great way of
describing a quantitative variable.
| | 06:46 | Between summary, and describe, and also
the five-number summary, we've now taken
| | 06:51 | a good look at the numerical description
of each of our quantitative variables,
| | 06:57 | and that gets us ready for some of the
more detailed analyses that we're going
| | 07:01 | to do in the next movies.
| | Collapse this transcript |
|
|
4. Modifying DataRecoding variables| 00:00 | When you've taken a thorough look at
your variables, you may find that some of
| | 00:04 | them may not be in the most
advantageous form after your analyses.
| | 00:08 | Some of them may require, for instance,
rescaling to be more interpretable.
| | 00:12 | Others may require transformations,
such as ranking, logarithms, or
| | 00:15 | dichotomization to work well
for your purposes.
| | 00:18 | In this movie, we're going to look
at a small number of ways that you can
| | 00:22 | quickly and easily recode variables
within R. For this one, we're going to be
| | 00:26 | using the data set we've used before,
social network, and I'm going to load that
| | 00:31 | by simply running a line 12 here.
| | 00:34 | And then I'm going to be using the
psych package, because it gives me some
| | 00:39 | extra options for what I want to do here.
| | 00:41 | So, I'm going to run line 15 to
install it, and then run line 16 to load it.
| | 00:45 | Now, what I'm going to do right here
is I'm going to first take a look at
| | 00:50 | the variable times; the number of times that
people say they log in to their site each week.
| | 00:55 | The easiest way to do this is
with a histogram, because it's a
| | 00:58 | quantitative variable.
| | 00:59 | I'm going to run line 19.
| | 01:02 | What we have here is an
extraordinarily skewed histogram.
| | 01:06 | You see for instance that nearly everybody
is in the bottom bar, which says they
| | 01:10 | log in somewhere between
0 and 100 times per week.
| | 01:14 | We have somebody in the 100 to 200 range,
and then we have another person we saw
| | 01:19 | before in the 700 to 800 range.
| | 01:21 | The normal reaction to this might be
simply to exclude those two people, because
| | 01:26 | they are such amazing outliers, and yet,
you can do that, but I want you to see
| | 01:30 | that there are other ways to deal with it.
| | 01:33 | The first thing I'm going to do is one
common transformation; it actually doesn't
| | 01:36 | change the distribution, it just
changes the way that we write the scores, and
| | 01:39 | that's to turn things into z-scores,
or standardized scores.
| | 01:43 | And what that does is it says how
many standard deviations above or below the
| | 01:47 | mean each score is.
| | 01:49 | Fortunately, we have a built-in
function for that, and it's called scale.
| | 01:53 | So, what I'm going to do is I'm going to
create a new variable called times.z for
| | 01:57 | z-scores of time, and I'm going to use
scale, and then sn for the social network
| | 02:03 | data frame, and then the variable Times.
| | 02:05 | So, I'm going to run line 24 here,
| | 02:07 | and you see that on the right side on
your workspace, I have a new variable
| | 02:11 | that has popped up.
| | 02:12 | It's actually a double matrix,
which is an interesting thing.
| | 02:15 | I'm going to run line 25, and get
a new z distribution; a histogram.
| | 02:20 | You see, it should look the
same as the Times distribution.
| | 02:23 | It's pretty similar, but
it's abended differently.
| | 02:26 | And so, some of the people who are in
the 0 to 100 in range, if they were in,
| | 02:30 | like for instance, the 50 to 100, they
got put into different bin, but you still
| | 02:34 | see that we have these two
incredible outliers here.
| | 02:37 | I'm going to get a
description of the distribution.
| | 02:40 | This is where I have the trimmed
mean, and the median, so on, and so forth.
| | 02:44 | One of the interesting ones here is
at the end of the first line you see
| | 02:48 | the level of skewness.
| | 02:50 | Now, a normal distribution has
a value of zero for skewness.
| | 02:54 | This distribution has a level of
over 10, which is enormous for skewness.
| | 02:59 | Even more is on the next line is
kurtosis, which you don't always talk about.
| | 03:03 | One of the things that affects kurtosis,
which has to do with sort of how peaked
| | 03:07 | or pinched the distribution is; it's
affected a lot by outliers, and so we end up
| | 03:12 | having a kurtosis, which for a normal
distribution for a bell curve is zero, and we
| | 03:16 | have this incredibly high value of 120.
| | 03:19 | Anyhow, that just gives us some idea
of what we're dealing with here, and the
| | 03:23 | ways that we can transform it.
| | 03:24 | Okay, what I'm going to do next is
sometimes when you have a distribution with
| | 03:30 | outliers on the high end, it can
be helpful to take the logarithm.
| | 03:33 | You can take the base 10
logarithm, or the natural logarithm.
| | 03:37 | I'm using the natural log here, and what
I'm going to do is I'm going to create a
| | 03:42 | new variable here called times.ln0,
and this just takes the straight natural
| | 03:46 | logarithm of the values.
| | 03:48 | Now, I'm going to do this twice, because
there's a reason why this one doesn't work.
| | 03:52 | I'm going to just show it to you.
| | 03:54 | I'm going to run line 29, and now you
see on the workspace on the right I've got
| | 03:58 | a new variable, and I'm going
to get a histogram.
| | 04:01 | The histogram is really nice.
| | 04:02 | You can tell it's almost
like a normal distribution.
| | 04:04 | It's a lot closer, but if I run the
describe, I get some very strange things.
| | 04:09 | The mean, we have sort of this negative
infinity, and we have not a number for
| | 04:13 | all sort of things, and the
descriptions don't work well.
| | 04:16 | The problem here is that if you do
the logarithm, and you have zeros in
| | 04:20 | your data set, you can't
do logarithms for zero.
| | 04:23 | And so a workaround for this that is
adequate is to take all of the scores and add 1.
| | 04:30 | That's what I'm doing right here.
| | 04:31 | Now I'm going to create a new variable
called times.log1, and what I'm going
| | 04:36 | to do is I'm going to take the value of
Times, and add 1 to it, so there's no more zeros.
| | 04:42 | The lowest value is going to be 1, the
highest is now going to be 705, and I'm
| | 04:47 | going to take the logarithms of those.
| | 04:48 | So, I'm going to hit that, and run line 33,
and then I'll take a look at the histogram.
| | 04:53 | You see the histogram is very different,
because the last one simply excluded all
| | 04:57 | the people who said they had zeros.
| | 04:58 | Now they're in there, and so you can
see that the bottom bar has bucked up.
| | 05:02 | I'm going to run describe now.
| | 05:04 | Now I actually get values, because I'm not
full of infinite values or not a numbers.
| | 05:08 | If you have zeros, adding 1 can make
the difference between being able to
| | 05:13 | successfully run a
logarithm transformation or not.
| | 05:18 | The next step is to actually rank the
numbers, and this forces them into nearly
| | 05:24 | uniform distribution.
| | 05:25 | What I'm going to do here is I'm
going to use the ranking function.
| | 05:28 | I'm going to put times, rank, and so
it's going to convert it into an ordinal
| | 05:32 | variable from first, to
second, to third, to fourth.
| | 05:35 | If I just run it in its standard form,
you see there it's created a new variable
| | 05:39 | over there; I'm going to
get the histogram of that.
| | 05:42 | Now, what's funny about this histogram
is, theoretically, if we have one rank for
| | 05:47 | each person, there should be a totally
flat distribution, and that's obviously
| | 05:50 | not what we have here.
| | 05:52 | The reason for that is
because we have tied values.
| | 05:54 | A lot of people put zero, a
lot of people put 1, and so on.
| | 05:57 | I'm going to run the describe just in case.
| | 06:00 | There are a lot of ways in R
for dealing with tied values.
| | 06:03 | In line 41, you see, for instance, the
choices are to give the average rank, to
| | 06:09 | give the first one, to give a random
value, to give the max, the min, and all of
| | 06:14 | these are used in different circumstances.
| | 06:16 | I'm going to use random for right now,
because what it does is it really flattens
| | 06:21 | out the distribution, so
I'm going to run line 42.
| | 06:23 | Now it's going to be
times.rankr, for random.
| | 06:28 | Then I said I'm going to rank it, but I'm
specifying how I'm going to deal with ties.
| | 06:33 | So, ties.method; in this
case, I'm going to use random.
| | 06:37 | I run that, and if you look over here
in the workspace, I now have that
| | 06:42 | variable down at the bottom.
| | 06:43 | I'm going to come back to the editor, and
run line 43, and now look; that's totally flat.
| | 06:50 | If I run describe, you see, for instance,
that the mean's 101.5, which is what we
| | 06:54 | would expect with this distribution, and
it's just flat all the way. Skewness is zero.
| | 06:59 | We have a negative kurtosis,
because this is actually what's called a
| | 07:02 | platykurtic distribution.
| | 07:04 | Anyhow, that's exactly what we
would expect with a totally ranked
| | 07:07 | distribution with no ties.
| | 07:09 | The last thing I'm going to do is I'm
going to dichotomize the distribution.
| | 07:13 | Now, a lot of the people get very
bent out of shape about dichotomization.
| | 07:17 | They say you should never do this,
because you're losing information in the
| | 07:21 | process, and that's true.
| | 07:22 | We're going from a ratio level variable
down to a nominal or ordinal level variable.
| | 07:30 | So, we are losing some information.
| | 07:32 | On the other hand, dichotomization,
when you have a very peculiar distribution,
| | 07:37 | can make it more analytically amendable.
| | 07:40 | More to the point, it's
easier to interpret the results.
| | 07:43 | I do not feel that is never appropriate
to dichotomize; to split things into two.
| | 07:48 | I feel there's a time and a place for it.
Just use it wisely, know why you're doing
| | 07:51 | it, and explain why you did it.
| | 07:54 | Anyhow, it would feel like the appropriate
way to do this would be to say, for
| | 07:58 | instance, if x is less than this value,
then put them in this other group, but
| | 08:02 | that doesn't work properly.
You'll get some peculiar results.
| | 08:05 | Instead, you need to use this one line
function in R; it's called if else, and it's
| | 08:10 | written as one word.
| | 08:11 | And in line 48, what I'm going to
do is create a new variable. It says
| | 08:15 | time.gt1, because I'm going to dichotomize it
on whether they log in more than once per week.
| | 08:23 | So, GT stands for greater than one.
| | 08:25 | And then I have the assignment operator,
and then I use the function ifelse.
| | 08:30 | And then what you do is you have
in parentheses three arguments.
| | 08:33 | The first one is a test, and so I'm
going to say is times greater than one; sn
| | 08:39 | is the data frame, and the dollar sign
it says I'm going to use a variable, then
| | 08:44 | times is the name of the variable,
and if that's greater than one, then the
| | 08:48 | second argument is what to do
| | 08:49 | if that test is true; then give
them a one on the variable time.gt1.
| | 08:55 | If their score on times is not
greater than one, so if it's zero or one,
| | 09:00 | then give them a zero.
| | 09:02 | So, I'm going to run line 48, and now
you can see over here I have got a new
| | 09:07 | variable, GT1, and then I'm going to get the
description of that one by just writing its name.
| | 09:12 | And what you can see here is it's
printed out the entire data set. It's taken
| | 09:17 | all the people who said they logged in
zero or one times, and it's given them zeros.
| | 09:23 | Everybody who logged in two or more
times got a one, and the people who
| | 09:27 | didn't respond to the question in the first
place still have their NAs for non-applicable.
| | 09:31 | And so, that's a form of dichotomization
of a distribution that can be done in
| | 09:36 | a way that advances your purposes, and
can be done, I feel, with integrity, if
| | 09:40 | it's done thoughtfully.
| | 09:41 | These are some of the options for
manipulating the data, and getting it ready for
| | 09:46 | your analyses, and of course, there's
an extraordinary variety of what's
| | 09:51 | available, but these are some of the
most common choices, and hopefully some of
| | 09:55 | them will be useful for you.
| | Collapse this transcript |
| Computing new variables| 00:00 | In the last movie, we looked at ways
that you could use R to recode or transform
| | 00:04 | individual variables to make them
more suitable for your analyses.
| | 00:08 | In this movie, we're going to
look at ways that you can combine
| | 00:11 | multiple variables into new
composites, and how those procedures can
| | 00:15 | work for your purposes.
| | 00:16 | In this example, I'm actually going to
be creating my variables in R. What I'm
| | 00:21 | going to do down here with line 6 is
I'm going to create a new variable called
| | 00:25 | n1, which stands for normal number one,
and I'm going to use the function rnorm,
| | 00:31 | which means random normal.
| | 00:32 | So, it's going to be drawing values
from the normal distribution, the bell
| | 00:36 | curve, at random, and I'm going to get a million
values. It's going to take about this long.
| | 00:41 | Now I have a million random values.
| | 00:43 | Let's get a histogram of those.
| | 00:46 | There you see it's pretty
much a perfect bell curve.
| | 00:49 | It's symmetrical, it's
unimodal, it's uniform; it's great.
| | 00:52 | Then I'm going to do the procedure
again and create another variable called n2.
| | 00:57 | That's also normal distribution;
a million values drawn at random.
| | 01:01 | You can see in the Workspace
I've got that one, and I'm going to get
| | 01:04 | its histogram as well.
| | 01:05 | It's essentially identical.
| | 01:07 | Again, it's a normal distribution,
| | 01:08 | it's unimodal, and it's got
the bell curve shape.
| | 01:11 | Now what I'm going to do is I'm
going to create a composite variable.
| | 01:15 | This is the point here.
| | 01:17 | I'm going to do it by simply adding
each value from these different vectors.
| | 01:22 | Now, this is the beautiful thing about
R is that it's made for vectors, and so
| | 01:26 | all I have to do is say that my new
variable, which I'm calling n.add, in line
| | 01:31 | 14, it gets n1 + n2, and R knows it's
to take the first item in n1, and add it
| | 01:39 | to the first item in n2, then go to the
second item in n1, add it to the second item in n2.
| | 01:46 | So, I'm going to run that line 14, and you
see I have a new thing in the Workspace.
| | 01:51 | I'll get a histogram for that one.
| | 01:53 | That's also a bell curve.
| | 01:54 | The range is a little bit larger,
because I'm adding instead of just averaging.
| | 01:58 | Then I'm going to do one more thing;
instead of adding them, I'm actually
| | 02:01 | going to multiply them.
| | 02:03 | So, I'm going to have n for normal.mult.
| | 02:05 | And again, because we have these vector-
based mathematics, I'll just say n1 * n2.
| | 02:11 | First item in n1 multiplied times
the first item in n2, and so on.
| | 02:15 | I'll create that one.
| | 02:17 | It shows up in the Workspace,
and I get the histogram.
| | 02:19 | It's going to look a little
different this time.
| | 02:22 | The reason for that -- you see it's
really high in the center, it drops down, and
| | 02:26 | it goes all the way
down to -10, and up to 10.
| | 02:30 | The reason for that is, when you multiply
values from two independent unit normal
| | 02:35 | distributions, you actually get
something that approximates what's called a
| | 02:39 | Cauchy distribution.
| | 02:40 | It's a very unusual distribution that
has a tremendous number of outliers, and
| | 02:44 | that's what I've got here.
| | 02:46 | Now, the one statistic where the Cauchy
is most distinctive is in kurtosis, which
| | 02:50 | has to do with how peaked or pinched
the distribution is, and is affected a lot
| | 02:55 | by the presence of outliers.
| | 02:56 | In order to get kurtosis easily, I'm going
to install the package psych. It installs it.
| | 03:04 | In line 23, it loads it.
| | 03:06 | From there, I can calculate the
kurtosis for each of my four distributions.
| | 03:11 | Now, for the normal distributions,
I expect it to be close to zero.
| | 03:14 | So, kurtosis for n1 is essentially 0,
and also for n2, it's very close to 0.
| | 03:21 | I'd expect it to be close to
zero for the addition one, but for the
| | 03:26 | multiplied one, I
expected it to be a larger value.
| | 03:28 | In fact, that's nearly six.
| | 03:30 | So, you can see the other one is very
close to zero, and that the major difference
| | 03:34 | in the fourth one where I
multiplied is in the level of kurtosis.
| | 03:38 | Anyhow, the idea here is that I've
been able to take variables that I created
| | 03:41 | here, and then combine them in
different ways to create new variables.
| | 03:46 | So, I have these ways of manipulating the
data to get these composites, and that's
| | 03:50 | something that you do, for instance,
when you're creating an average score based
| | 03:54 | on a survey of many different questions.
| | 03:55 | R makes these vector-based
operations very, very easy.
| | 03:59 | The operations used in this movie are
just two options out of an essentially
| | 04:03 | infinite variety for combining your
individual variables into new composite
| | 04:07 | variables for your analyses.
| | 04:08 | R makes it very easy to find methods
for your own work that can get your data
| | 04:13 | into exactly the shape that you need.
| | 04:16 | So, the speed, flexibility, and power
of R are especially helpful as you
| | 04:21 | manipulate data, and get
ready for your own analyses.
| | Collapse this transcript |
|
|
5. Charts for AssociationsCreating simple bar charts of group means| 00:00 | Once you've taken a look at all of your
variables individually, and you've gotten
| | 00:04 | them into the shape that you need for
your analyses, the next step is often to
| | 00:08 | start looking at associations
between variables.
| | 00:11 | A very common form of association is to
look at group membership, and how that's
| | 00:15 | associated with scores
on a quantitative outcome.
| | 00:18 | I'm going to use an example for this
to show a couple of different ways of
| | 00:22 | depicting group distributions by
using bar charts, and also by box plots.
| | 00:28 | For this one, I'm going to be using a data
set that is based on Google searches by state.
| | 00:33 | The idea here is that the Google
search data is showing how many standard
| | 00:38 | deviations above or below the national
average each state is in their relative
| | 00:43 | interest in a search term.
| | 00:45 | The first thing I'm going to do is
I'm going to load a data set called
| | 00:49 | google_correlate.csv.
| | 00:50 | I put it into a data frame called Google.
| | 00:52 | There are 51 observations, because there
are 51 states, and there is D.C. Next,
| | 00:56 | I'm going to just run to see what the
names of the variables are. That's line 7.
| | 01:00 | What we have is State, that's the name
of the state, then the state_code, that's
| | 01:06 | like CA for California.
| | 01:07 | Then we have their relative interest in
data visualization; so, how often do they
| | 01:13 | search for that relative
to their other searches?
| | 01:15 | Then we also have searches for Facebook,
searches for NBA, and for fun, to put
| | 01:21 | down whether that state had an NBA team.
| | 01:23 | Also, the percentage of people in that
state with a college degree, whether that
| | 01:28 | state had a K-12 curriculum for
statistics, and the region of the country.
| | 01:34 | Let's take a closer look at
that with structure; that's str.
| | 01:38 | If I hit that, and make this bigger, it
gives you the idea of how many levels they are.
| | 01:43 | It gives you the first few data values.
| | 01:46 | So, that's a way to
seeing what we're dealing with.
| | 01:48 | I'm going to clear that
out, because it's pretty busy.
| | 01:51 | Put that back down.
| | 01:53 | One of the interesting questions might be, do
the responses to one of these vary by region?
| | 01:59 | I thought I'd look at data visualization,
and I want to see whether it varies by
| | 02:03 | regions in the United States.
| | 02:04 | So, the easiest way to do this is to
first create a new data set, a table or frame,
| | 02:10 | where I split the data by region.
| | 02:13 | So, what I'm going to do in line 12
is I'm going to create my new unit as
| | 02:17 | searching for data visualization,
.reg for region, and then we're going to
| | 02:23 | get the distributions.
| | 02:24 | I'm going to use the R function split, and
then I tell what it is that I'm going to split.
| | 02:29 | I'm going to use the data set Google,
and the variable data_viz; the dollar
| | 02:34 | sign joins those two, and I'm going to
split it by the variable region that's in
| | 02:38 | the Google data set.
| | 02:39 | I'm going to run line 12 now.
| | 02:42 | You see how that shows up in
the Workspace on the right.
| | 02:44 | So, I have this new list.
| | 02:47 | Then I'm going to draw boxplots by region.
| | 02:50 | I'm going to use a boxplot here, and I'm
going to go back to my new data frame or
| | 02:55 | list for interest in data visualization.
| | 02:58 | I'm also going to color it
lavender. There we have it.
| | 03:01 | What this shows us is the
distribution for each region.
| | 03:05 | So, for instance, you can see here
that the box indicates the range of
| | 03:09 | the middle 50% of states in that region;
their relative interest in data visualization.
| | 03:14 | So, we see that there's a lot of
variation in the west, because its boxes are
| | 03:19 | wider than the others.
| | 03:20 | There's less variation among
the middle 50% in the northeast.
| | 03:24 | That's because the box is tighter.
| | 03:26 | But we have outliers in the northeast.
| | 03:28 | We have one that's unusually
low, and one that's unusually high.
| | 03:31 | Interestingly, the state with the
highest relative interest in data
| | 03:35 | visualization is in the south, and
that's where we have a z-score of over three.
| | 03:39 | You can see the northeast is generally
higher than the others, with the exception
| | 03:43 | of that one outlier.
| | 03:44 | So, that's one way to get a feel for the
variations and distributions by groups.
| | 03:51 | Another very common way is
to do barplots for means.
| | 03:54 | That's what I'm going to do down here.
| | 03:56 | I'm going to create another data
set here where I'm going to use means.
| | 04:00 | And so 18 says viz.reg,
| | 04:03 | so visualization, and the .reg is for
region, except this time I'm doing the means.
| | 04:08 | This makes it so I can do the bar chart.
| | 04:11 | I'm going to use the R function s apply.
| | 04:13 | Then I'm going to tell it what I'm
dealing with, and that's relying on the list
| | 04:19 | that I got on the last one.
| | 04:20 | This time I'm going to be
calculating the mean.
| | 04:22 | So, I'm going to do that in 18.
| | 04:25 | Then I'm going to run a barplot.
| | 04:26 | And so I'm telling it
barplot what it is I'm charting.
| | 04:30 | I'm going to color it beige, and I'm going
to give it a title that's rather long here.
| | 04:34 | I'll scroll to the end
for a moment. There we go.
| | 04:37 | By the way, this right here
means to break it into a new line.
| | 04:41 | The backslash is the escape
character, and n is the new line.
| | 04:44 | Then this backslash right here means I
actually wants to print these quotes,
| | 04:49 | because otherwise it thinks I'm done
with the title, and then I have to do it
| | 04:53 | again at the end of data visualization.
| | 04:55 | This one, because it's not escaped,
it means it's the end of that command.
| | 04:59 | So, I'm going to go back to the beginning,
and I'm going to run that command by
| | 05:03 | itself, barplot, by highlighting
those three lines, and then pressing run.
| | 05:10 | So, now I've got a barplot.
| | 05:11 | It shows where the average
is for each of these groups.
| | 05:14 | On the other hand, there is one thing
that's missing that would be really nice,
| | 05:18 | and that is we don't have
a zero axis line.
| | 05:20 | Fortunately, I can add that
manually with this abline function.
| | 05:23 | All I've got to do is put
the height. It's at zero.
| | 05:27 | If I highlight all of that, and run it
together, now I get the means plot, and
| | 05:32 | this time, it has the reference line
at zero, which is a lot easier to read.
| | 05:38 | Finally, it would be nice to have the
actual numbers that go with each of these things.
| | 05:42 | What I'm going to do to facilitate this is
I'm going to use the psych package again.
| | 05:48 | The first one installs it,
and this one loads it for use.
| | 05:51 | Then I'm going to do describeBy.
| | 05:53 | It says, I want to take the variable data_viz,
and I want to break it down by region.
| | 06:00 | This is based on describe.
| | 06:01 | It just does it categorically.
| | 06:03 | I'm going to make
this one down here bigger.
| | 06:05 | As you can see that, for each area, I
know that there are 12 states in the
| | 06:10 | midwest, 9 in the northeast,
17 in the south, 13 in the west,
| | 06:14 | and this gives me
the mean for each of these.
| | 06:16 | So, for instance, you see that the
midwest, the mean score is -0.32.
| | 06:20 | That's what we see over here.
| | 06:23 | This bar comes down to -0.32.
| | 06:25 | In the northeast, the mean is 0.45;
| | 06:28 | it's positive, and we come up here.
| | 06:30 | Again, these are z-scores indicating
relative interest and searching on Google
| | 06:35 | for data visualization compared to
all of the other searches in that area.
| | 06:39 | Anyhow, these box plots and these
means plots are one way of looking at how a
| | 06:44 | quantitative variable differs from one
group to another, and it can often be an
| | 06:49 | important step in an analysis.
| | Collapse this transcript |
| Creating scatterplots| 00:00 | When you're looking at associations in
your data, if you want to look at how two
| | 00:04 | quantitative variables are associated
with each other, the most common approach
| | 00:08 | is to create a scatterplot.
| | 00:10 | R gives you some interesting options on
how to create scatterplots, and look at
| | 00:14 | what you have in terms of
associations in your data.
| | 00:17 | For this one, I'm going to be using the Google
correlate data that I used in the last movie.
| | 00:21 | I'm going to load it by running line 6.
| | 00:23 | I'll create a data frame
called Google by reading the csv,
| | 00:27 | google_correlate.csv, that has a header.
| | 00:30 | There I have 51 observations.
| | 00:32 | There's one line for each state, and D.C.
We're going to look at the names of
| | 00:37 | the variables that are in that data set.
| | 00:39 | We can look at the structure too if we want,
just to get an idea of what things look like.
| | 00:44 | I'm going to make this
bigger for just a moment.
| | 00:46 | Okay, that's pretty busy.
| | 00:47 | I'm going to just clear it
out for right now.
| | 00:50 | What I want to ask is whether there's
an association between the percentage of
| | 00:53 | people in the state with college degrees,
and interest in data visualization as a
| | 00:58 | search term on Google.
| | 00:59 | What I'm going to do is
create a scatterplot.
| | 01:02 | The default plot works well.
| | 01:04 | All I say is plot; that means scatterplot,
and I give my variables for X and Y.
| | 01:09 | I'm going to put degree on the X, and
so I say, use degree from the data set
| | 01:13 | Google, and then I'm going
to put data_viz on Y.
| | 01:17 | So, I run line 13, and there's my plot.
| | 01:20 | You can see that there's a strong
positive association. The higher the number
| | 01:25 | of people with college degrees, the greater the
interest in data visualization as a search topic.
| | 01:30 | That's actually a really clear trend.
| | 01:33 | On the other hand, I'm gong to
clean up this chart a little bit.
| | 01:36 | I'm going to put a title on the top.
| | 01:38 | This is lines 15 through 20.
| | 01:40 | I'm going to do the plot again, except
this time I'm going to put a title on the top;
| | 01:45 | that's main, and then I'm going to put
a label on the X axis, xlab, Population
| | 01:50 | with College Degrees.
| | 01:51 | Label on the Y axis;
Searches for Data Visualization.
| | 01:55 | Pch here is for representing the points,
and I'm going to be using choice number
| | 01:59 | 20, which is a small solid dot.
| | 02:01 | I'm going to color it in gray.
| | 02:03 | So, I'm going to highlight those
six lines together, and run those.
| | 02:09 | Now we have this scatterplot with
light gray dots, which you can still see the
| | 02:13 | pattern, but there's less
sort of fluff to it.
| | 02:16 | We have the title on the top, and
we have the labels for each axis.
| | 02:19 | Now I'm going to do one more thing.
| | 02:21 | When you're looking at an association
in the scatterplot, even though we have a
| | 02:25 | strong positive pattern here, it's
really nice to have regression lines.
| | 02:30 | I can add a regression line with a abline.
| | 02:32 | I'm going to use a linear model,
that's what this is, and it's going to be
| | 02:38 | based on the association, where I'm
trying to predict data_viz, and then the
| | 02:42 | tilde means predicting it from the number
of degrees, and I'm going to color that line red.
| | 02:48 | So, I'm just going to run line 23, and
this is going to layer on top of the plot
| | 02:52 | that I have already.
| | 02:53 | So, you can see that there's a strong
positive association if we draw a
| | 02:58 | straight line through it.
| | 02:59 | On the other hand, not every association
is linear, and sometimes it's helpful
| | 03:03 | to use a line that
matches the shape of the data.
| | 03:07 | One of those options was called a Lowess
smoother, and that's what I'm going to do in line 25.
| | 03:13 | I'm going to add a line, and it's going
to be Lowess, and I'm going to be using
| | 03:19 | it for a degree in data_viz.
| | 03:21 | Please note that the order of the
two variables is different here.
| | 03:24 | The top one for the regression line, I
had to put the Y first, and then the X.
| | 03:28 | This one, I put the X, and then the Y.
| | 03:30 | Also, in the top one, I use the tilde
to say that the Y is predicted by the
| | 03:35 | X. This one is simply putting what
they are with a comma in between.
| | 03:38 | I'm going to make this
Lowess line blue.
| | 03:41 | So, I'm going to run line 25, and
then I'll just put it on top of that.
| | 03:45 | A lowess is sort of a moving average,
and you can see here that actually it
| | 03:49 | doesn't deviate tremendously
from the linear regression line.
| | 03:53 | What both of these do is they emphasize
the strong positive association between
| | 03:58 | the percentage of the population in the
state who have college degrees, and the
| | 04:02 | relative interest in searching
for data visualization on Google.
| | 04:06 | These are really good ways of
looking at the association between two
| | 04:10 | quantitative variables, and will lead
into regression, which we're going to do
| | 04:14 | in a later movie.
| | Collapse this transcript |
| Creating scatterplot matrices| 00:00 | In the last movie, we looked at how
to create a scatterplot to show the
| | 00:04 | association between two
quantitative variables.
| | 00:06 | On the other hand, sometimes you have
several quantitative variables, and you
| | 00:10 | want to look at the
associations between each of them.
| | 00:13 | One option, in that case, is to create
what's called a scatterplot matrix, which
| | 00:17 | has several scatterplots
arranged in rows and columns.
| | 00:20 | I'm going to use the Google search data.
| | 00:23 | I'm going to load it in by
saying Google gets read.csv, and so on.
| | 00:28 | Let's just take a look at
the variable names in it.
| | 00:31 | There's what we've got.
| | 00:32 | State, state_code, data_viz is a search
term, Facebook is a search term, NBA is
| | 00:37 | a search term, whether the State has an
NBA team, the percentage of people with
| | 00:41 | degrees, whether they have a stats_education
curriculum in the K through 12 system,
| | 00:46 | and the region of the U.S.
| | 00:48 | Now what I'm going to do is I'm
going to take each of the quantitative
| | 00:51 | variables -- data_viz, degree, Facebook,
and NBA -- and I'm going to put them into
| | 00:57 | a scatterplot matrix.
| | 00:58 | What I'd do is I'm going to first specify
that data_viz is the ultimate outcome variable.
| | 01:03 | I'm just going to stick it on the top left.
| | 01:05 | Then I'm going to add these
other quantitative variables.
| | 01:07 | I don't have to say Google and then dollar
sign for each of these, because I can
| | 01:11 | specify data separately.
| | 01:12 | I'm also going to be using
solid dots for the data points.
| | 01:16 | I'm going to put a title on the top
that says Simple Scatterplot Matrix.
| | 01:20 | If I highlight all four of those lines
at once, and run those, here's my matrix.
| | 01:24 | I'm going to zoom in for a moment.
| | 01:27 | So, what I have here is data_viz on the
top left. Going down on the first column,
| | 01:34 | Data_viz is going to be across the
bottom; interest in data visualization.
| | 01:38 | On the other hand, going across the top
row, interest in data visualization is
| | 01:43 | going to go up on the Y axis.
| | 01:44 | So, you can see some of
these have very strong patterns.
| | 01:47 | So, we have the column on the left,
| | 01:50 | the second one down is the association
between the data visualization, and the
| | 01:54 | percentage of people with
degrees that we saw before.
| | 01:56 | It's a very strong pattern.
| | 01:58 | On the other hand, things like Facebook
and data_viz show negative associations.
| | 02:04 | That's a nice way to get a look at a
whole bunch of things at once, but I want
| | 02:08 | to show a modified version of this
that provides even more information.
| | 02:13 | To do this,
I have to use the psych package.
| | 02:16 | I'm going to download and
install it by running line 16.
| | 02:22 | Then I'll load it,
so I can use it with line 17.
| | 02:26 | In line 18, I'm going to use what's
called the pairs.panels, which is a function
| | 02:31 | within psych, and I can tell it the data
set that I'm going to use is Google.
| | 02:36 | Then I'm specifying which variables I
want by the order that they appear in
| | 02:41 | the Google data set.
| | 02:42 | Data_viz is the third one, Degree is the seventh,
Facebook is the fourth, and NBA is the fifth.
| | 02:48 | That's why I'm specifying those with c for
the concatenator combined into a function.
| | 02:53 | Also, I'm making it so there are
no gaps between the panels here.
| | 02:57 | You see, for instance, in the Simple
scatterplot matrix on the right, we've
| | 03:00 | got the thick bars in between them that
unfortunately become visually pretty prominent.
| | 03:04 | I'm going to get rid of
those by putting gap = 0.
| | 03:07 | This makes an unusual matrix.
| | 03:09 | So, we'll run that, and take a look.
| | 03:12 | Then I'm going to zoom in on this one.
| | 03:15 | What we have here are several things.
| | 03:17 | First off, we have a histogram for
each of the four quantitative variables.
| | 03:22 | On top of it, we have overlaid, what
is called a kernel density estimator.
| | 03:26 | It's like a normal distribution,
but you see it can have bumps in it.
| | 03:30 | You'll see that on degree.
| | 03:32 | At the very bottom of that, it's
really tiny here, but we have sort of a
| | 03:37 | dot plot that shows where the actual
scores are for each one with these
| | 03:41 | tiny vertical lines.
| | 03:42 | Then what we have are
the scatterplots.
| | 03:45 | These are on the bottom left
side of the matrix.
| | 03:49 | We have the scatterplot with the dot
for the means of the two variables.
| | 03:53 | We have a lowess smoother
coming through;
| | 03:54 | that's the curved red line.
| | 03:57 | Then the ellipse is sort of
a confidence interval for the
| | 04:00 | correlation coefficient.
| | 04:02 | The sort of the longer and narrower the
ellipse, the stronger the association,
| | 04:07 | the rounder, the less the association.
| | 04:11 | The numbers that are on the top side
are mirror images of these, and those are
| | 04:15 | correlation coefficients
for each one of them.
| | 04:17 | So, for instance, we can see that the
correlation between data_viz, and degree is
| | 04:21 | positive, and it's 0.75.
| | 04:23 | Correlations go from zero to one.
| | 04:26 | Zero is no linear relationship, and
one is a perfect linear relationship.
| | 04:30 | They are positive if there is an uphill
relationship, and negative if it's downhill.
| | 04:34 | That's a very strong association.
| | 04:36 | On the other hand, you can see that
interest in data_viz, and interest in NBA as
| | 04:41 | a search term -- that's the scatterplot
| | 04:43 | that's in the very bottom left --
| | 04:45 | it's kind of circular and
scattered all over the place.
| | 04:48 | If you look at the very top right of this
matrix, you see the correlation is 0.23.
| | 04:51 | It's not very strong.
| | 04:53 | Anyhow, this is a really rich kind of
matrix that shows histograms, it shows
| | 04:59 | dot plots, it shows kernel density
estimators, it shows scatterplots with
| | 05:03 | lowess smoothers, and its correlations,
and it's one of the great reasons for
| | 05:08 | using the psych package.
| | 05:09 | Anyhow, this is a variation on the
scatterplot matrix, which lets you look at
| | 05:14 | graphically the association of several
quantitative variables at once, and get a
| | 05:18 | really good feel for the interrelationships
within your own data set.
| | Collapse this transcript |
| Creating 3D scatterplots| 00:00 | In the last movie, we looked at how you
could use scatterplot matrices to show
| | 00:04 | the associations between several
quantitative variables simultaneously by
| | 00:09 | creating a 2D matrix of scatterplots.
| | 00:12 | In this movie, I want to look at an
interesting variation where you actually use
| | 00:16 | a 3D scatterplot that
rotates in space with the mouse.
| | 00:21 | To do this, I'm going to use the
data set google_correlate that I've used
| | 00:25 | for the other ones.
| | 00:26 | I'm going to load it on line 6.
| | 00:28 | Just get a list of names with line 7.
| | 00:30 | Then there's actually several
ways to do 3D scatterplots in R.
| | 00:34 | I'm going to be using the package rgl.
| | 00:40 | I've now downloaded it, and installed it.
| | 00:42 | Now I'm going to open it to run.
| | 00:44 | Then what I'm going to do is
just run this one set of code.
| | 00:47 | plot3d is the function, and then you
need to list the x, y, and z variables.
| | 00:53 | So, I've got them all as data_viz from
the Google data set degree from the
| | 00:57 | Google data set, and Facebook.
| | 00:59 | Those are relative interest
as search terms.
| | 01:02 | Then I'm also adding
labels for the x, y, and z axis.
| | 01:05 | I'm going to color the dots in the
scatterplot red, and make them three pixels.
| | 01:11 | If I highlight all of that code,
and run it --
| | 01:14 | this plot is a little different,
because it doesn't open in the bottom right
| | 01:18 | window, and instead,
it opens a new window.
| | 01:20 | I'm going to come down here and click.
| | 01:22 | I can make that larger.
| | 01:24 | What I can do now is click on the
mouse, and drive this one around to see the
| | 01:30 | association in three dimensions.
| | 01:36 | Now, this is a nice heuristic thing,
although it usually only works while it's
| | 01:40 | moving, because as soon as you stop moving, it
collapses, and it's hard to read what it is.
| | 01:45 | But it does give interesting
possibilities for looking at the associations
| | 01:49 | between three variables, so we can
try to find the strongest association.
| | 01:53 | We've got a data point
way up here in the corner.
| | 01:58 | Anyhow, while it's interesting for
exploring, it's hard to report these,
| | 02:02 | especially in a printed 2D format, but a
3D scatterplot, an interactive spinning
| | 02:06 | one, can be a potentially informative,
and certainly an engaging way of exploring
| | 02:11 | the relationship between
several quantitative variables.
| | Collapse this transcript |
|
|
6. Statistics for AssociationsCalculating correlations| 00:00 | Once you've looked at the associations
between several quantitative variables,
| | 00:04 | a natural next step is to start looking
at the numerical associations between them.
| | 00:09 | The most common way of doing this is
with correlations or Pearson product-moment
| | 00:13 | correlation coefficients.
| | 00:15 | In this movie, we're going to look
at how to calculate correlations for
| | 00:18 | individual pairs of variables, as well as to
create a matrix for an entire set of variables.
| | 00:23 | We're going to do this with
the google_correlate data.
| | 00:26 | I'm going to load that right here, and
just remind myself of the variable names.
| | 00:31 | Then what I'm going to do is I'm going
to create a new data set that has just
| | 00:36 | the quantitative variables.
| | 00:38 | There are several ways to specify these.
| | 00:41 | What I'm doing is I'm going to put g
for Google, .quant, for quantitative, and
| | 00:46 | that gets from the Google data frame;
| | 00:48 | I'm going to select four variables,
| | 00:50 | and what I'm using is
the concatenate function.
| | 00:53 | That's the c. I'm selecting the variables
by their number of where they appear.
| | 00:58 | That's why I have this
names list right here.
| | 01:00 | So, data_viz is the third one, degree
is the seventh, Facebook is the fourth,
| | 01:04 | and NBA is the fifth.
| | 01:07 | I'm going to create that new set.
| | 01:09 | You can see that shows up there
in the Workspace under g.quant; 51
| | 01:13 | observations, one for each state, and
for D.C., with these four variables in
| | 01:17 | that particular order.
| | 01:18 | The next thing I'm going to do is I'm
going to get a correlation matrix for
| | 01:22 | that entire data set.
| | 01:23 | R has a built-in function, cor, for
correlate, and all I have to do is specify my
| | 01:28 | data frame here, and I hit run.
| | 01:31 | What I get is a bunch of correlations.
| | 01:33 | Remember, correlations go from zero,
which means no linear association at all,
| | 01:38 | to positive or negative one, which
indicates a perfect linear association.
| | 01:42 | Positive is an uphill relationship.
| | 01:44 | Negative is downhill.
| | 01:46 | On the diagonal, we have ones.
| | 01:48 | That's a variable correlated with itself.
| | 01:50 | You see that we have some
really strong correlations.
| | 01:52 | So, for instance, the association
between data_viz and degree is 0.745.
| | 01:59 | That's a very strong correlation.
| | 02:02 | Also, the association between data_viz
and Facebook as interest as search terms
| | 02:07 | is negative, and very strong.
That's -0.63.
| | 02:12 | So, the more interest there is in
searching for Facebook, the less interest there
| | 02:16 | is in searching for data
visualization, and vice versa.
| | 02:19 | This is a correlation matrix that is
without the probability tests associated,
| | 02:26 | and I want to show you
how to deal with those.
| | 02:29 | Now, the easiest way with a built-in
function in R is to do one correlation at a time.
| | 02:34 | So, you pick one x variable, and one y
variable, and then use the function cor.test.
| | 02:40 | That's correlation test.
| | 02:42 | What's it's going to do is give the
correlation coefficient, the hypothesis
| | 02:46 | test, the p-value associated
with that, the confidence interval.
| | 02:50 | In this one, I'm specifying my variables
by saying the variable name, and then with
| | 02:54 | the dollar sign, also the
data set that it comes from.
| | 02:57 | So, I'm going to run line 17 for cor.test
right now, and look at data_viz and degree.
| | 03:03 | I get a fair amount of
printout from this one.
| | 03:05 | It tells me that it's doing the Pearson's
product-moment correlation coefficient,
| | 03:09 | because there are other choices.
| | 03:11 | It's telling me the two
variables that I'm using.
| | 03:13 | It's giving me a t-test
for the significance test.
| | 03:17 | The value of t is 7.83, with 49
degrees of freedom, which has to do with the
| | 03:22 | sample size, and the probability of
getting a correlation this big through
| | 03:26 | random chance is extremely small.
| | 03:28 | In fact, you see that we have to go to
-10, and there are a lot of zeroes there.
| | 03:34 | The 95% confidence interval for this
correlation coefficient is from 0.59 --
| | 03:39 | that's the low N -- to 0.84.
| | 03:41 | So, it's going to be a
high correlation either way.
| | 03:43 | Then we have the actual sample
correlation there at the bottom.
| | 03:46 | It's 0.7455, which is what you
see up in the matrix above also.
| | 03:51 | That's a good way to do it if you're
willing to do one correlation at a time.
| | 03:55 | On the other hand, if you want to do
the entire matrix at once, what you can do
| | 04:00 | is get a probability matrix
by using the package Hmisc.
| | 04:04 | I'm going to download and
install the package Hmisc.
| | 04:09 | That downloads it and installs it.
| | 04:11 | Now I'm going to load it. It's okay.
| | 04:13 | I've got this little information
here about some of the changes that it's
| | 04:16 | making, but that's fine.
| | 04:17 | I can ignore those.
| | 04:18 | Now, the only trick here is I'm
going to use the function rcorr;
| | 04:23 | correlations, but the thing is I have
to take my data set g.quant, and it has
| | 04:28 | to become a matrix.
| | 04:31 | Right now, it's a data frame.
| | 04:32 | See, a data frame can have lots of
different kinds of data in it, that each
| | 04:36 | variable can be of a different kind,
but a matrix has to be all the same kind.
| | 04:39 | So, what I'm going to do is I'm
going to coerce it into being a matrix.
| | 04:43 | That's the term here.
| | 04:44 | So, I'm using the function rcorr, and
then I put as.matrix, and then I put my
| | 04:50 | little data frame right here.
| | 04:51 | That says treat it as a matrix, or coerce it into
be in the matrix, and then do the correlations.
| | 04:56 | So, I'm going to run that now; line 25.
| | 04:59 | Let's make this bigger, so
we can see what's going on.
| | 05:02 | What I have here on the top
is the correlation matrix.
| | 05:04 | It's the same as what I had above earlier.
| | 05:07 | Let me scroll up a little bit.
| | 05:11 | There's the correlation matrix.
| | 05:12 | The two differences are, that
one's got a lot of decimal places,
| | 05:16 | this one has only two, so it's more manageable,
plus this one actually says the sample size.
| | 05:20 | It says 51 there, and equals 51. But
the really important part, and the reason
| | 05:24 | I did this one, is because the second
matrix says the P there; these are the
| | 05:30 | probability values.
| | 05:32 | If you're doing an inferential test --
and what you're looking for here, for
| | 05:35 | statistical significance, is
a value that's less then 05.
| | 05:38 | For instance, the probability value
for the correlation between data_viz and
| | 05:43 | degree, it comes out as four zeros.
| | 05:46 | It's not totally zero.
| | 05:47 | It's just going to take
longer to show up.
| | 05:49 | The association between data_viz and
Facebook, also, lots of zeros. It's significant.
| | 05:54 | But the association between data_viz
and NBA, the correlation, if you look above,
| | 05:59 | is 0.23, and the probability value is 0.01.
| | 06:03 | So, that's not statistically significant,
nor is the association between degree and NBA.
| | 06:09 | That's fine, but the idea here is
that we can look at what the actual
| | 06:13 | correlations are for
several variables at once, and
| | 06:16 | using this package, Hmisc, we can also get the
probability values associated with each one.
| | 06:22 | That's a great first step in looking at
the statistical associations between the
| | 06:27 | variables in my data set.
| | Collapse this transcript |
| Computing a regression| 00:00 | When you're trying to understand the
associations in your data, it's helpful a
| | 00:04 | lot of times to single out one particular
outcome variable and that you want
| | 00:09 | to see how you can predict scores on that
one using the other variables in your data set.
| | 00:13 | This is the situation where you would
want to use a multiple regression with
| | 00:17 | multiple predictors for a
single quantitative outcome variable.
| | 00:21 | Fortunately, R makes this extremely simple.
| | 00:24 | In this example, I'm going to use the
google_correlate data that I've had before.
| | 00:28 | I'm going to load it right here with line 6,
and just check the names of the variables.
| | 00:33 | I just want to mention a couple of things.
| | 00:35 | I'm going to be predicting
interest in data visualization.
| | 00:38 | This is how common that term is as a
Google search term relative to other
| | 00:43 | searches on a state by
state basis. That's my outcome.
| | 00:46 | I'm going to be using
several quantitative variables.
| | 00:49 | I'm going to be using degree; that is,
the percentage of people in the state with
| | 00:53 | the degree, and Facebook as a search
term, and NBA as a search term, but I'm
| | 00:58 | going to be throwing in a couple of
other interesting ones that normally you
| | 01:02 | would think would
require some extra prep.
| | 01:03 | Stats_ed, which is my second predictor
here, is a yes/no variable, and it's
| | 01:09 | entered as text; whether they have a
curriculum for Statistics education in the
| | 01:14 | K through 12 system or not.
| | 01:16 | Also, region; let me
scroll over a little bit here.
| | 01:19 | Region is a categorical
variable with four levels on it.
| | 01:23 | This is going to be an interesting one,
because normally, I would need to do some
| | 01:28 | sort of transformation to make this
work, but R is smart, and it takes care of
| | 01:32 | these things all by itself.
| | 01:33 | So, what I'm going to do here is I'm
going to create a multiple regression model.
| | 01:38 | On line 9, I'm going to
assign it to a variable.
| | 01:41 | I'm calling it reg1, for regression one.
| | 01:45 | Then, what I'm going to use is I have the
assignment operator, and lm is for linear model.
| | 01:51 | The first thing I specify in there is
my outcome variable; that's data_viz.
| | 01:55 | Then the tilde sign next to it means
as a function of, and then I give
| | 01:59 | all the predictors.
| | 02:00 | I have degree + stats_ed + facebook, and so
on until I get to the end; I have a comma.
| | 02:08 | Then I have a single thing that says,
all of these variables come from the
| | 02:11 | data set Google, so I don't have to put the
Google dollar sign in front of each one of them.
| | 02:16 | I'm going to select these
three lines, and run those.
| | 02:20 | That is performing the regression.
| | 02:22 | Interestingly, it doesn't
give me the results.
| | 02:24 | You can see in the Workspace
that it's run it.
| | 02:26 | I have reg1 over here; it's a linear
model, but if I want to see the results, I
| | 02:32 | need to ask for a summary
of the regression.
| | 02:34 | Remember, I saved it as reg1, and so
now I'm going to get the summary of that
| | 02:38 | just by running line 12.
| | 02:39 | I've got a lot of output here, so I'm
going to make the console bigger and
| | 02:44 | come up a little bit.
| | 02:45 | Here's what's going on.
| | 02:46 | The first one says what
the actual model is.
| | 02:49 | The function that I'm calling is lm,
linear model, and the formula that I'm
| | 02:53 | using, data_viz, is a function of all
of these other things put together, and
| | 02:57 | they all come from the Google data set.
| | 02:59 | Now, the residuals are a way of
assessing how well the model fits the data.
| | 03:04 | There are situations where
you would want to use those.
| | 03:06 | Normally, you would plot them
instead, but there they are.
| | 03:09 | It's the ones below that that are particularly
interesting; it's the coefficients.
| | 03:14 | The column on the left gives the name
of the variable, and you see that we have
| | 03:18 | degree, and then stats_edyes.
| | 03:21 | What it's done is it's taken may stats_ed
variable, which had two values, yes and
| | 03:26 | no, and it's automatically decided to
put it as yes as a one, and no as a zero.
| | 03:33 | Then we have Facebook, and NBA,
which are fine, because those are both
| | 03:36 | quantitative variables.
| | 03:38 | Then whether they have an NBA team -- I
don't expect that to be associated, but
| | 03:42 | I've included it just because I could.
| | 03:44 | Then I have three region variables.
| | 03:47 | The reason there are three when there
are actually four regions is because in
| | 03:50 | order to avoid multicollinearity,
you have to leave one of them out.
| | 03:54 | Otherwise, there's a perfect
association between the predictors.
| | 03:58 | Then we look at the column
that says Estimate.
| | 04:00 | These are the actual
regression coefficients.
| | 04:04 | Then we have standard error for them, and
then the t value as the inferential test.
| | 04:08 | The probability value on the
end is the significance test.
| | 04:11 | Fortunately, it's also putting
asterisks next to the ones that are
| | 04:15 | statistically significant.
| | 04:17 | The intercept is significant,
which means the intercept is not zero;
| | 04:20 | we don't really care about that.
| | 04:21 | What we do have however, are two
statistically significant predictors that we
| | 04:25 | can use to predict interest in data
visualization as a Google search topic.
| | 04:29 | The first one is degree.
| | 04:31 | States that have a higher proportion of
people with college degrees also show a
| | 04:37 | higher interest in
searching for data visualization.
| | 04:40 | The other one that is statistically
significant within this context is Facebook,
| | 04:44 | except this time it's negative.
| | 04:46 | States that show a higher interest
in searching for Facebook show a lower
| | 04:51 | interest in searching for
data visualization.
| | 04:54 | This particular regression model is
what's called a simultaneous entry; that
| | 04:58 | takes all these variables, and
it throws them in there all at once.
| | 05:02 | It highlights the ones that are statistically
significant within the context of
| | 05:05 | that entire collection.
| | 05:07 | What we have here is just two: more
degrees, more interest in data_viz: more
| | 05:12 | interest in facebook,
less interest in data_viz.
| | 05:14 | Then at the very bottom here, we also
have some summaries for the entire model.
| | 05:19 | We have the residual standard error.
| | 05:21 | I'm not really worried
about that right now.
| | 05:23 | The multiple R-squared is an important
one, because that tells us what proportion
| | 05:28 | of the variance in the dependent or
outcome variable, that's data_viz, can be
| | 05:32 | predicted by the
combination of these other variables.
| | 05:37 | My multiple R-sqaured is 65,
| | 05:39 | so 65% of the variance in data
visualization as a relative search term from
| | 05:45 | state to state could be
predicted from these other variables.
| | 05:49 | The adjusted R-squared has to do with
the relationship of predicators to sample
| | 05:52 | size, and because I actually do have a
small sample, because it's a state by state
| | 05:56 | thing; it's a little smaller, but
still, it's a good prediction model.
| | 06:00 | Then I also have the F-statistic, which
can be used as an inferential test for
| | 06:03 | that R-squared, and just confirms
that it's statistically significant.
| | 06:07 | Anyhow, this is the simplest possible
version of a multiple regression that you
| | 06:12 | can do in R. It's a way of taking
several variables, both quantitative, and
| | 06:17 | dichotomous predictors, and multiple
category predictors, and throwing them in
| | 06:22 | there. R processes them appropriately,
and we're able to get a prediction of a
| | 06:26 | single quantitative outcome, and it's
a great way to start looking at the
| | 06:30 | important associations in your data.
| | Collapse this transcript |
| Creating crosstabs for categorical variables| 00:00 | When you're looking at the associations
in your data set, a lot of times you're
| | 00:04 | going to want to look at the associations
between two categorical variables, and
| | 00:08 | that's when you want to use a cross tabulation,
and usually a chi square test of significance.
| | 00:14 | That's the simplest possible version of it.
| | 00:17 | In this example, I'm going to be using
the social network data, though I need to
| | 00:21 | mention, I did make one
modification to it.
| | 00:23 | There was one case that did
not have information on gender.
| | 00:27 | Since I'm using gender here as a
predictor variable, I wanted to have that
| | 00:31 | missing case out,
so I deleted the one case.
| | 00:33 | So, we're going to go from 202 cases to 201.
| | 00:36 | I'm going to list the
names of the variables.
| | 00:40 | We have ID, gender, age, their
preferred social networking Web site, and the
| | 00:45 | number of times that they log in per week.
| | 00:48 | I'm going to be looking at the association
between gender and site to see, for
| | 00:52 | instance, if men and women report different
Web sites as their preferred method
| | 00:56 | for social networking.
| | 00:58 | The easiest way to do this is
by creating a contingency table.
| | 01:02 | I'm going to call it sn.tab.
| | 01:04 | That's for social network
dot tabulation or table.
| | 01:08 | I'm using the table function that's
part of R. All I need to say is what my two
| | 01:13 | variables are; two categorical variables,
and I'm using gender, and the sn -- the
| | 01:18 | dollar sign means it's from the
sn data set -- and I'm using Site.
| | 01:22 | So, I'm just going to run line number 11.
| | 01:25 | You see that the table shows up in
the Workspace there on the right.
| | 01:28 | Then on line 12, I just have sn.tab.
| | 01:31 | That's just going to put it out.
| | 01:33 | So, there I have the number of men and
women who report Facebook, LinkedIn,
| | 01:37 | MySpace, None, Other, and Twitter.
| | 01:40 | Looking at this, you can see
there's a couple of interesting things.
| | 01:43 | First off, identical numbers of
men and women prefer Facebook.
| | 01:47 | LinkedIn, Twitter, and Other are
so small as to be negligible here.
| | 01:51 | Again, this data set is a few years old.
| | 01:53 | You see that MySpace has a much higher
number of women reporting it as their
| | 01:58 | primary method, and then for None,
there's a lot more men who say they use None.
| | 02:04 | These work in with some expected patterns.
| | 02:06 | Now, these are just the frequencies
or the counts; the cell frequencies.
| | 02:10 | On the other hand, it can be really
nice to get marginal frequencies, which are
| | 02:14 | the totals for the rows and the
columns, and it can also be nice to get
| | 02:17 | percentages or proportions.
| | 02:19 | So, what I'm going to do is
I'm going to scroll down here.
| | 02:22 | First, just get the marginal frequencies.
| | 02:24 | I'm going to get the row frequencies, and
that's going to be just the number of men and women.
| | 02:29 | So, I have 98 women and I have 103 men.
| | 02:33 | The fact that they both have 46 in
Facebook; they're closely balanced anyhow,
| | 02:37 | so that's essentially the same.
| | 02:39 | Now I'm going to look at the column
marginal frequencies, and that tells me the
| | 02:43 | overall number of people who
prefer each social networking site.
| | 02:47 | We've seen this before when we've done
bar charts for this variable, but now a
| | 02:52 | more interesting one is to get the
proportions of people within each cell, and
| | 02:57 | also the proportions who
report using each one of these.
| | 03:01 | To do this, I'm going to use prop.table.
| | 03:04 | That's proportions for the table, but
I'm wrapping it up in a thing that rounds
| | 03:09 | off the number of decimal places.
| | 03:10 | It gives a huge number by
default, and I only want two.
| | 03:14 | What I'm doing with each one of
these is, to get the cell percentages,
| | 03:17 | I'm doing prop.table right here, and
it tells that I want to use sn.tab,
| | 03:22 | that's the table for social network
as my data set, and I'm wrapping it in
| | 03:27 | round to two decimal places.
| | 03:30 | So, I'm going to run line 20.
| | 03:32 | 23% of respondents in this data set
are women who said they like Facebook.
| | 03:37 | 1% are men who said they like Twitter.
| | 03:41 | These, all together, these
10 numbers add up to 100.
| | 03:44 | Now let's look at the row percentages.
| | 03:47 | Similar procedure, but now what they
do is they add up to 100 going across.
| | 03:53 | Say, for instance, we had dramatically
different numbers of men and women.
| | 03:56 | This would allow us to compare the
relative interest in each of these sites, even
| | 04:00 | with unbalanced marginal frequencies.
| | 04:03 | You can see, for instance, that MySpace,
the numbers mirror what we saw earlier.
| | 04:08 | 18% of the women like MySpace,
whereas only 4% of the men.
| | 04:12 | Then finally, line 22, let's just do a
similar thing going in the other direction.
| | 04:17 | Now these percentages add up going down.
| | 04:20 | So we see, for instance, that for
MySpace, 82% of the people who said they like
| | 04:24 | MySpace were female; 18% were male.
| | 04:27 | So, these are ways of looking at the
data in several different dimensions.
| | 04:31 | The last thing that I'm going to do is
I'm going to actually do an inferential
| | 04:35 | test to see if the distribution of
preferred networking sites differs by gender.
| | 04:42 | This is a statistical significance test,
and I'm using chi square in
| | 04:45 | this particular case.
| | 04:46 | The function for this is chisq.test,
because we're doing the inferential
| | 04:52 | test, and then what the data set is the
tabulation, or the table that I'm working, sn.tab.
| | 04:58 | I hit that one, and it's doing
the Pearson's Chi-squared test.
| | 05:02 | It tells me what data I'm using, and
then it's doing the X-squared here.
| | 05:06 | So, the value for chi squared is 13.2076,
and with 5 degrees of freedom. The
| | 05:11 | probability value, and that's the one that
I'm really interested in here, is 0.02.
| | 05:15 | That's less than 0.05,
which is the standard cutoff for
| | 05:19 | statistical significance.
| | 05:20 | So, this tells me that the variations
between men and women in their preferred
| | 05:25 | social networking sites, those are
bigger than we would expect by chance; that
| | 05:28 | they, in fact, are likely reliable
differences between men and women in what they prefer.
| | 05:33 | This shows up in terms of women are
much more likely to prefer MySpace than men
| | 05:37 | are, and men are much more likely to
report that they have no preferred site.
| | 05:41 | This warning message on the bottom, it
says that chi squared approximation may
| | 05:45 | be incorrect; that's going to have to
do, because I have a relatively small
| | 05:49 | sample, and I have some, what are
called, sparsely populated cells.
| | 05:53 | Normally, for a chi square to be reliable,
you're going to want to have a certain
| | 05:58 | expected frequency of five or ten cases
per cell; not observed frequencies, but
| | 06:02 | expected frequencies,
which is a different thing.
| | 06:04 | But mostly, I may want to exclude some
of these social networking sites from the
| | 06:08 | analysis, or combine them, so I can bump
up the expected frequencies, and better
| | 06:13 | meet the requirements of the chi square.
| | 06:15 | That being said, I still have good
evidence that suggests that there are gender
| | 06:19 | differences in preferred social
networking site by using the cross tabulated
| | 06:22 | data, and the chi squared test
for significance.
| | Collapse this transcript |
| Comparing means with the t-test| 00:00 | One common inferential test is
to compare two groups on a single
| | 00:04 | quantitative outcome.
| | 00:05 | While there are several ways to do this,
the most common is to use a T-test.
| | 00:09 | In this particular example, we're going
to show that this is a very simple thing
| | 00:13 | to do in R. I'm going to use the google
_correlate data that I've used before.
| | 00:17 | I'm going to load that.
| | 00:19 | I'm just going to bring up
the list of names.
| | 00:21 | What I'm going to do here, just for fun,
is I'm going to look at interest in NBA
| | 00:26 | as a search term, and see if that differs
between states that have NBA basketball
| | 00:32 | teams, and states that don't.
| | 00:35 | So, all I need to do is
come down here to line 10.
| | 00:38 | I'm using the function t.test.
That makes sense.
| | 00:40 | I'm saying what my outcome variable is.
| | 00:43 | That's NBA; that means as a search term, and
then the predictor is whether they have an NBA,
| | 00:49 | so has_nba is a yes/no variable.
| | 00:50 | I'm just going to run line
10 here, and maximize this one.
| | 00:55 | You see here that it's telling me that
it's using the Welch Two Sample t-test.
| | 00:59 | That is something that allows for
unequal variants between samples.
| | 01:03 | It tells me the data that it's using.
| | 01:04 | It's using the variable NBA as the
outcome, by whether a state has an NBA team.
| | 01:10 | The value for t is -4.745, and
then the degrees of freedom is 37.105.
| | 01:17 | You get a fractional degree of
freedom when you use the Welch test.
| | 01:20 | The p-value is 3.071, but there are a
lot of decimal places in front of it.
| | 01:26 | What this tells us is that it's statistically
significant; that there is in fact
| | 01:30 | a reliable difference.
| | 01:31 | You can get that by looking
at the 95% confidence interval.
| | 01:34 | You see that the difference
between the two groups ranges somewhere
| | 01:38 | between -1.6, and -0.6.
| | 01:43 | Since this is based on something that
has an average of zero for the nation,
| | 01:47 | that's a reasonable size, and then in fact,
it gives us the means for the two groups.
| | 01:52 | The groups that do not have an NBA team,
the states that don't have one, they're
| | 01:57 | average interest in NBA as a
search score is negative. It's -0.5.
| | 02:03 | That means that they as a group are
half a standard deviation below the mean
| | 02:08 | in searching for NBA.
| | 02:09 | On the other hand, the number on the
right, the 0.62, that is the z-score for
| | 02:15 | search interest in NBA for
states that do have a team.
| | 02:18 | So, they're a little more than half
a standard deviation above the mean.
| | 02:22 | Anyhow, this is a very simple procedure.
| | 02:25 | It compares two groups, those who do
or do not have NBA teams, on a single
| | 02:30 | quantitative outcome, and in this case,
we found a statistically significant
| | 02:33 | difference between the two groups.
| | Collapse this transcript |
| Comparing means with an analysis of variance (ANOVA)| 00:00 | When you're looking at associations in
your data, the final test that we want to
| | 00:03 | look at right now is comparing several
groups on a single quantitative outcome.
| | 00:09 | If you're comparing just two, you
would use a t-test, but when you have more
| | 00:12 | than two, you usually want to use an
analysis of variance, or ANOVA instead.
| | 00:16 | For this example, I'm going to use the
google_correlate data that we've used before.
| | 00:21 | I'm going to load it, and just
get a list of the variable names.
| | 00:25 | The first test that I'm going to
do is what's called a one way ANOVA.
| | 00:30 | That is where you're comparing
several groups, but on a single factor.
| | 00:34 | So, what I'm going to do here is
I'm going to look at interest in data
| | 00:37 | visualization by region.
| | 00:40 | I have four regions.
| | 00:41 | The way I set this up is, first, I'm going
to create a model here that I call anova1.
| | 00:47 | By the way, I'm using the
assignment operator <- to save this.
| | 00:50 | The function is aov, for analysis of variance.
| | 00:54 | Then I specify the outcome variable,
which is data_viz, and then the tilde; it
| | 00:59 | could be read as a function
of, or as predicted by, region.
| | 01:03 | Then I have the comma.
| | 01:05 | That says both of these
came from the data set Google.
| | 01:07 | That way I don't have to put Google
dollar sign in front of each one of these.
| | 01:11 | I run the model by
simply hitting run on line 10.
| | 01:14 | You can see that it showed up
there in the in the Workspace.
| | 01:16 | I have this model anova1.
| | 01:18 | Then I'm going to get a
summary of this model by running 11.
| | 01:22 | What I have here is I have the degrees
of freedom for the model, based on region.
| | 01:28 | There are four regions, so there's three
degrees of freedom, and I have the residuals.
| | 01:32 | Then I have what's called the sum of squares,
and the mean squares, and I have the F value.
| | 01:37 | The F, which is 1.059, is the inferential test.
| | 01:40 | The last one, Pr(>F) is the probability value.
| | 01:45 | If that value is less than 0.05, then I
usually have a statistically significant
| | 01:50 | difference between my groups.
| | 01:52 | Now, this one is 0.376.
| | 01:54 | That's much higher than 0.05.
| | 01:56 | What this tells me is, while there is a
difference between the means of these
| | 01:59 | four regions, there's about 38% chance
of getting a difference that big just
| | 02:04 | through random error, and so that's
considered just random fluctuation.
| | 02:07 | This tells us that even though there are
differences between the means, it's not
| | 02:11 | considered statistically
significant or reliable.
| | 02:14 | Now, that's a one way analysis of
variance, where I'm using a single
| | 02:17 | classification variable.
| | 02:18 | What's really common, for instance, in
experimental research is to do a two way
| | 02:22 | classification, or a factorial design.
| | 02:25 | Now, there's two different ways to specify
this, and I'm going to show you both of them.
| | 02:29 | They give the exact same results.
| | 02:30 | The first one, I'm going to save as
the model that I'm calling anova2a, and
| | 02:35 | I use the same function, aov, for
analysis of variance, and I specify my
| | 02:40 | outcome variable, data_viz, and then
the tilde to say that it's a function
| | 02:44 | of, or predicted by.
| | 02:45 | Then I'm going to use region again,
and I'm going to throw into it whether a
| | 02:50 | state has a stats education
curriculum in the K through 12 system.
| | 02:55 | Then I'm also going to have the
interaction between those two.
| | 02:58 | So, the region colon stats_ed is a
way of specifying the interaction.
| | 03:03 | That's an important thing when you
do a factorial analysis of variance.
| | 03:06 | Then the last line says, and all these
variables come from the data set Google.
| | 03:10 | So, I'm going to run that model by
highlighting those three lines, and I press in run.
| | 03:16 | You can see that it showed
up in the Workspace there.
| | 03:18 | I'm going to get the summary
by running line 18.
| | 03:21 | What we have here is
several lines. One is for region.
| | 03:26 | It says, is there a
difference by region all by itself?
| | 03:29 | The second is for stats_ed.
| | 03:31 | Is there a difference by
stats_ed all by itself?
| | 03:34 | The third one is the interaction of
region and statistics ed, and it says,
| | 03:39 | does the average score for region depend
on whether they have stats education or not?
| | 03:44 | Actually, what you see here is we have
the degrees of freedom, then the sum of
| | 03:48 | squares, the mean squares,
and then the F value.
| | 03:50 | The F value is the inferential test.
| | 03:52 | In the last column, the Pr is
the p-value; the probability value.
| | 03:56 | If those are less than 0.05,
then it's statistically significant.
| | 03:59 | You can see that none of them are.
| | 04:01 | So, really, this tells us that these
two predictors, region, and the presence or
| | 04:06 | absence of a stats education
curriculum, and their interaction are not
| | 04:09 | significantly associated with
interest in data visualization on Google.
| | 04:14 | I'm just going to show you the exact
same test in a different way, because there
| | 04:17 | are two ways to specify it.
| | 04:19 | This one, I think, is a little easier.
| | 04:21 | This one, the model is anova2b, because
it's my second ANOVA, but I'm setting it
| | 04:26 | up in the second way, so that's the B.
I use aov, for analysis of variance,
| | 04:31 | data_viz is my outcome, the tilde
for predicted buy, or as a function of.
| | 04:35 | This time, instead of spelling out
all three, I just say region*stats_ed.
| | 04:40 | So, this by that, and both
come from data set google.
| | 04:44 | I highlight those three lines, and run them.
| | 04:46 | That shows up in the Workspace on the
right, and then I'm going to get the
| | 04:49 | summary for that one.
| | 04:50 | I'm going to make this
console a bit bigger right now.
| | 04:54 | You can see that I have the exact same
results between the two different ones.
| | 04:58 | I just find the second version of the
analysis of variances is, for me, easier to
| | 05:02 | set up, although the earlier one is
more explicit, where you're spelling out the
| | 05:06 | main effect and interaction.
| | 05:09 | Anyhow, in this particular
case, these effects were not
| | 05:12 | statistically significant.
| | 05:13 | Analysis of variance can be a really
good way at looking at group differences on
| | 05:17 | a quantitative variable.
| | 05:19 | In experimental research, it's
often the analysis that is of the
| | 05:22 | greatest interest.
| | Collapse this transcript |
|
|
ConclusionNext steps| 00:00 | Thanks for joining me on Up and Running
with R. Before we go, I want to give you
| | 00:05 | a few tips on directions you can take to
better understand R, and how you can use
| | 00:10 | it in your data analysis.
| | 00:11 | Now, I've actually saved an R script
that has this information in it, just as
| | 00:16 | text and comments, but
let me give you some ideas.
| | 00:19 | First off, there are additional courses
on the Lynda.com Online Training Library
| | 00:24 | that would be worth investigating.
| | 00:25 | One, for instance, is Interactive Data
Visualization with Processing; a language
| | 00:30 | that is command line, like R, but
developed specifically for creating graphics.
| | 00:36 | Another one is SPSS
Statistics Essential Training.
| | 00:39 | SPSS is another very
common statistical package.
| | 00:43 | In that course, which is in greater
depth on a lot of the statistical
| | 00:46 | procedures, can give you information
about what the procedures would look like
| | 00:49 | even when they're conducted in R.
Similarly, Lynda.com has a collection of
| | 00:55 | courses on the use of databases,
such as SQL, MySQL, or MongoDB.
| | 01:01 | In addition, there are a
number of books that can be useful.
| | 01:04 | One is R in a Nutshell:
| | 01:06 | A Desktop Quick Reference (2e)
by Joseph Adler.
| | 01:10 | Also, the R Cookbook by Paul Teetor is a
great reference for practical examples
| | 01:16 | of working with data.
| | 01:18 | In a similar vein, the R Graphics
Cookbook by Winston Chang gives detailed
| | 01:23 | information on producing graphs, and
modifying them, with the tremendous
| | 01:27 | flexibility offered in
the R program in Language.
| | 01:30 | In fact, there's a very long list of
books available at the R project Web site.
| | 01:34 | Just see the URL that's in
the script for this movie.
| | 01:38 | Also, there are a couple of books
available that are specific to RStudio, which
| | 01:43 | we've been using in this course.
| | 01:44 | One is Getting Started with RStudio by
John Verzani, and the other is Learning
| | 01:50 | RStudio for R Statistical Computing by
Mark P. J van der Loo and Edwin de Jonge.
| | 01:56 | There are also a number of Web sites
that provide very active and comprehensive
| | 02:01 | support for R. The most significant of
these is the R project Web site itself,
| | 02:05 | r-project.org, which is a tremendous
resource, and a gateway for other sources.
| | 02:11 | They also publish the R Journal.
| | 02:13 | That's an open access refereed
journal of the R project for statistical
| | 02:17 | computing, and that's
available at journal.r-project.org.
| | 02:21 | In addition, there are hundreds of Web sites.
| | 02:24 | One of the nice things is the Web site
r-bloggers.com is a compilation of Web sites.
| | 02:30 | That is, it's news and tutorials
about R contributed by over 400 bloggers.
| | 02:34 | It's a very active Web site
with 200 to 300 posts per month.
| | 02:38 | There's also a specialized search site.
| | 02:41 | It's rseek.org by Sasha Goodman, and
that allows you to specifically search
| | 02:46 | information relevant to R. Also,
StackOverflow has discussions on R. Just search
| | 02:53 | for those ones that are tagged on R.
At Wikibooks, there's an R Programming
| | 02:58 | Wikibook available also.
| | 03:00 | You can see the URL available in the script.
| | 03:02 | In terms of software, you
might also want to look at Rcpp.
| | 03:07 | Those of you who are comfortable with C++
can find an implementation of R in C++
| | 03:14 | written by Dirk Eddelbuettel and
Romain Francois that gives vastly improved
| | 03:19 | speed for large calculations.
| | 03:22 | There's a series of tutorials by
Hadley Wickham available for this through
| | 03:26 | github that you can find
on the URL in this script.
| | 03:30 | Finally, there are also support
groups and events available for people who
| | 03:34 | use R. The most significant is the
useR!, that's with a R and exclamation
| | 03:39 | point, which is an international conference
that takes place in June or July of each year.
| | 03:44 | Many large cities have local R user
groups, and you can see a complete list of
| | 03:48 | this at Revolution Analytics at
the URL provided in this script.
| | 03:53 | No matter how you decide to pursue it,
and the purposes that you use R for, I
| | 03:57 | think you'll find that there is
tremendous potential, flexibility, and the
| | 04:01 | opportunity to adapt R to whatever
purposes you have, and I think you'll be
| | 04:07 | extraordinarily pleased with what you
can accomplish with R. Happy computing!
| | Collapse this transcript |
|
|