IntroductionWelcome| 00:04 | Hi, I am Bart Poulson, and I
would like to welcome you to SPSS
| | 00:07 | Statistics Essential Training.
| | 00:09 | SPSS is a statistics and
data analysis program from IBM.
| | 00:13 | It's very popular, it's very powerful,
and it's a great way to work with your
| | 00:17 | data for new insights.
| | 00:19 | In this course, I'll demonstrate how
to use charts, such as histograms, bar
| | 00:24 | charts, scatter plots, and box plots
to get the big picture of your data.
| | 00:27 | I will show you how to use
inferential statistics, like T-Tests, analysis of
| | 00:32 | variance, and chi-square to help you
determine the reliability of your results
| | 00:37 | and how they can
generalize to a broader population.
| | 00:40 | I will also show you how to enter and read
data in SPSS and how to check and clean data.
| | 00:47 | If you're new to SPSS, I think you are
going to be amazed with what you can do.
| | 00:50 | If you're an experienced SPSS user,
there will be many new tools and methods
| | 00:55 | that can help you gain even more
insight from your data, and with that in
| | 00:59 | mind, if you're ready to get going,
let's get started with SPSS Statistics
| | 01:04 | Essential Training.
| | Collapse this transcript |
| Using the exercise files| 00:00 | If you are a Premium member of the
lynda.com Online Training library, or if
| | 00:05 | you're watching this tutorial on a DVD,
you will have access to the exercise
| | 00:08 | files used throughout this title.
| | 00:11 | The exercise files are contained in
a folder, and there's one SPSS project
| | 00:15 | folder for each movie.
| | 00:16 | Inside the SPSS project folder,
you'll find a data file and any other files
| | 00:21 | needed to follow along with the movie.
| | 00:23 | In some cases, there are additional
assets, like data files, syntax files, or
| | 00:28 | exported images and HTML files.
| | 00:30 | If you are a Monthly subscriber or
an Annual subscriber of Lynda.com, you
| | 00:34 | won't have access to the exercise
files, but you can follow along from scratch
| | 00:38 | with your own assets.
| | Collapse this transcript |
| Using a different version of the software| 00:00 | Before we get going, let me mention
something about versioning in SPSS.
| | 00:05 | SPSS has been around for over 40
years and has been revised frequently.
| | 00:10 | It's even has changed its name a few
times, from Statistical Package for the
| | 00:14 | Social Sciences--hence the initials
SPSS--to just the letters SPSS, to
| | 00:19 | Predictive Analytic Software, or PASW,
and then since it was purchased by IBM a few
| | 00:24 | years ago, it has been
known as IBM SPSS Statistics.
| | 00:28 | Now, the movies for this course were
created with Version 19 of SPSS, and new
| | 00:34 | versions roll along about once per year now.
| | 00:37 | However, only one of the movies in this
course relies on any features that are
| | 00:41 | brand new to this version of SPSS--
that's the movie on automatic linear modeling
| | 00:46 | by the way--and even then, I show how to
do the same things using commands that
| | 00:50 | have been in essentially
every version of SPSS ever made.
| | 00:54 | Everything else in this course relies
on procedures that have been in SPSS for
| | 00:58 | at least several years and several versions.
| | 01:01 | So while this course was created using
the current version of SPSS, it applies
| | 01:06 | almost universally to previous
versions of SPSS and no doubt to future
| | 01:10 | versions of SPSS as well.
| | Collapse this transcript |
|
|
1. Getting StartedTaking a first look at the interface| 00:00 | At first glance, SPSS resembles a
spreadsheet. There are rows and columns of data where
| | 00:06 | each column represents a variable, such
as a customer ID number, a question on a
| | 00:10 | survey, or a city's population, and each row
typically represents a case, which could
| | 00:16 | be a person, a company, an
advertising campaign, or whatever.
| | 00:19 | However, there's a lot more to SPSS than that.
| | 00:23 | First off, SPSS has more than one window.
| | 00:26 | It has two, or possibly three windows.
| | 00:28 | The window we are looking at right now
is the Data Editor window, or Data window,
| | 00:33 | and I have a sample data
set called searches.save opened.
| | 00:37 | This is a data set that contains
information about Google searches for specific
| | 00:41 | terms, such as SPSS or regression, for
each of the 50 states and Washington DC, and
| | 00:47 | I will be using this data set
frequently as a sample during this course.
| | 00:51 | If you look at the tabs on the bottom
left, this is what's called the Data view.
| | 00:56 | The Data view is the one
that looks like a spreadsheet.
| | 00:58 | However, there is also
one called a Variable view.
| | 01:00 | If you click on that then what you see
that it has information about the variables.
| | 01:06 | The first column is the variable names.
| | 01:09 | Variables in SPSS have to have a
single-word names. They can be up to
| | 01:13 | 64 characters, they can have underscores
or dots, and they can be upper- or lowercase.
| | 01:19 | Otherwise they need to be relatively short,
and again they do need to be a single word.
| | 01:24 | The next column is the type of the variable.
| | 01:26 | A string variable for instance is a
text variable, and the state codes like
| | 01:30 | CA or NY are entered as text.
| | 01:32 | Everything else in here is entered as
numbers and they're numeric variables, even
| | 01:36 | though several of them have words laid over
on top of them. I will show you in a moment.
| | 01:40 | The third one is the width of the variable and
the fourth one is the number of decimal places.
| | 01:45 | The next one is what's called the Label.
| | 01:47 | This means although the variables may
have short names, like state_code, the
| | 01:52 | label can be something
that's a little easier to read, like
| | 01:54 | State_code with capitalization. Or if
you go further down to row 18, you see
| | 01:59 | there is one called degree.
| | 02:00 | That's the name of the variable,
but the label is much longer.
| | 02:03 | It is percent of population
with bachelors degree or higher.
| | 02:06 | So label can be much more descriptive,
and since the label is what's going to
| | 02:10 | show up in a chart or in a table, you
want to make that long enough that it's
| | 02:13 | easy to tell what it is.
| | 02:15 | The next column is Values, and I
said that most of these variables are
| | 02:19 | entered as numbers.
| | 02:20 | Now some of them just are numbers.
| | 02:22 | The Google search information is numbers.
| | 02:25 | They tell you how high a particular
search term rates, relatively speaking,
| | 02:30 | compared to all others for a particular state.
| | 02:32 | On the other hand, other variables such
as 15, 16, and 17 which has NFL, has NBA,
| | 02:38 | and has MLS for Major League Soccer,
| | 02:41 | those are Yes/No variables. Those are
called indicator variables and I enter
| | 02:45 | them as 0 for no and 1 is yes.
| | 02:47 | So the numbers are what's in the
dataset, but you can see that I tell SPSS in
| | 02:51 | values, if I come over it and click on
that, that 0 equals No and 1 equals Yes, and
| | 02:56 | you can add them and change them
and remove them in this dialog box.
| | 03:00 | The next column is whether you want to
specify explicitly any particular value
| | 03:04 | to indicate missing information.
| | 03:07 | Say for instance a person forgets
to answer a question. You may want to
| | 03:09 | indicate that's an accidental
omission. Perhaps you can give that a 999 to
| | 03:14 | indicate that it's accidental. Or if you
didn't ask a question because it wasn't
| | 03:18 | relevant, you could give a different code
like 888, or whatever you want. Just make
| | 03:22 | sure it doesn't overlap
with the valid information.
| | 03:25 | The next column is simply how wide the
column is in the data set, and I make them
| | 03:29 | 11 spaces by default.
| | 03:31 | Let's scroll over a little bit here.
Then there is alignment within the
| | 03:35 | column: Left, Center, or Right.
| | 03:37 | The last two are specific statistical things.
| | 03:39 | This is what's called the Level of
Measurement and in SPSS a variable can either be
| | 03:44 | nominal, which means it's simply
indicates a different group and a string
| | 03:49 | variable where you write words as
nominal, but also a 01 indicator variable is
| | 03:54 | nominal, or the region of
the United States which has 4--
| | 03:57 | 1, 2, 3, 4--regions, that can be nominal.
| | 04:00 | A variable to also be ordinal. You can
indicate, for instance, the client with the
| | 04:04 | largest account, then the
second largest, and the third largest.
| | 04:09 | The other choice in SPSS is what's
called a Scale Variable, and you see there
| | 04:13 | is a little ruler next to it.
| | 04:15 | These are variables that are
measured as more or less in set units
| | 04:18 | so you can actually calculate
statistics like an average for them, whereas you
| | 04:22 | can't with a nominal variable.
| | 04:25 | The very last column is called the Role,
and this is a relatively new feature in
| | 04:29 | SPSS. And you specify, for instance,
whether a particular variable is to be used
| | 04:35 | as an input variable, that is, you're
using it to predict values on other things.
| | 04:39 | These are sometimes called
independent variables or predictor variables.
| | 04:43 | A variable can also be a target
variable, and that is, it's always something
| | 04:47 | that you're trying to explain, like
for instance spending on particular
| | 04:51 | products. Or a variable can be both,
sometimes an input, sometimes a target and
| | 04:56 | you see them marked as both.
| | 04:58 | Finally, a variable can also be marked as none.
| | 05:01 | That means it's not an
input or a target variable;
| | 05:03 | it's simply there for a state
code as an identifier or indicator.
| | 05:08 | And so those are the options
in the Variable View window.
| | 05:10 | Let me go back to the Data view now.
| | 05:13 | The next thing to note is you can
actually have a lot of variables in SPSS.
| | 05:17 | It's limited only by its
ability to address the variables.
| | 05:21 | It can address over two billion variables
in two billion cases, which you are unlikely
| | 05:26 | to hit in most situations.
But this is the Data window.
| | 05:30 | Now, what makes this different, also, aside
from the metadata and the Variable
| | 05:34 | window, is it when you run a command
in SPSS, unlike a spreadsheet, it doesn't
| | 05:38 | show up on the same page.
| | 05:39 | For instance, I am going to quickly
make a chart. I'm going to make what's called
| | 05:43 | a histogram for "interest
in SPSS" as a search term.
| | 05:46 | I go up to Graphs, and I click on
something of a Chart Builder, which I will
| | 05:51 | demonstrate more fully in a later movie.
| | 05:53 | I am going to pick a histogram and
drag it up into what's called the Canvas, take
| | 05:59 | SPSS, and put it down here.
| | 06:02 | Now what's interesting is I have a
lot of options about how I set this up--and we'll save that--
| | 06:05 | but I want to show you two things.
One is I can click OK and go straight from
| | 06:13 | that dialog box, not to the Data
window but to an Output window, and in the
| | 06:19 | Output Window I set it up so that it
gives me the written code that can produce
| | 06:23 | this chart over again.
| | 06:24 | That's the information about the
commands, and there's the chart.
| | 06:28 | But you see this is a separate window.
We had a Data window; now we have an Output window.
| | 06:32 | I am going to back to the command
for just a moment and show you an
| | 06:36 | optional third window.
| | 06:39 | Right next to the OK button there is
something called Paste, and if I click that,
| | 06:43 | it opens up a window called a Syntax
window, and this is just command-line code.
| | 06:48 | By pasting it, it has taken the written
commands for this particular chart and
| | 06:52 | it's put them in a Syntax window and I
can use it to either modify the commands
| | 06:58 | or I can use it to recreate
the command at a later time.
| | 07:00 | It's a great way of
sharing information with people.
| | 07:04 | So watch, I can simply highlight all
of this and I can come up and press the
| | 07:07 | big green Run button, the Play button.
| | 07:10 | If I hit that, you will see that
it's done it all over again.
| | 07:14 | It's a great way of replicating analyses.
| | 07:16 | For instance, you can set up an
analysis when you have only part of the data, or
| | 07:21 | you can run it
periodically as new data comes in.
| | 07:23 | It's a wonderful feature.
| | 07:25 | Now let me show you a couple
of other features here in SPSS.
| | 07:29 | One for instance, is under the
File menu and the Edit and the View.
| | 07:33 | These are common things.
| | 07:34 | The Data menu allows you to do a
number of procedures to modify the data--
| | 07:38 | I'll show this in a few movies--and so
does the transform to create new variables.
| | 07:43 | You can insert headings and titles
in your output. Analyze is the actual
| | 07:48 | statistical procedures menu,
we will go through that.
| | 07:51 | Now Direct Marketing here is a separate add-in.
| | 07:54 | SPSS has a lot of add-ins that you can
purchase separately to give increased
| | 07:58 | functionality to SPSS, but I
won't be demonstrating those.
| | 08:01 | The techniques that I am going to be
using in this particular course all involve
| | 08:05 | the base procedures that are available in SPSS.
| | 08:08 | The next command is to make graphs, and I
have a whole series of movies about those.
| | 08:13 | Utilities can be a way of getting more
information about the variables or about
| | 08:18 | creating scripts and production jobs,
which are more advance procedures
| | 08:22 | which we won't be covering in this course.
| | 08:24 | Add-ons gets into some of the other
services that you can purchase that connect
| | 08:29 | with SPSS, such as SPSS Modeler which is
for data mining and SPSS Text Analytics
| | 08:35 | for analyzing open-ended natural
language, like customer comments on a webpage or
| | 08:40 | twitter feeds--it's a great way to go.
| | 08:42 | And then finally, the Help menu here
gives you a huge amount of information.
| | 08:47 | Let me open up, for example, the
Tutorials, and this opens up in a web browser,
| | 08:52 | although it's a locally stored file.
| | 08:54 | And what you see here is an entire
collection of presentations that SPSS will
| | 08:59 | run through to teach you how to do any
of a number of procedures, and they are
| | 09:03 | very useful for learning how
to use SPSS in even more depth.
| | 09:08 | Back in SPSS, there is also what's
called the Command Syntax Reference.
| | 09:12 | This is a 2500-page searchable PDF
file about the command-line syntax
| | 09:18 | programming that you may be able
to use it at a later point in more
| | 09:21 | advanced functioning.
| | 09:23 | Now there are just a couple more things
I want to show you in SPSS about how
| | 09:27 | to set up the program.
| | 09:28 | If I come back to Edit and go down to
Options, there are number of things you
| | 09:33 | can do to customize the way SPSS works for you.
| | 09:36 | There's a few in particular I want to point out.
| | 09:38 | One is in this tab called Viewer.
Down at the bottom, on the left, there's a
| | 09:42 | checkbox for Display commands in the
log, and that's the thing that makes it
| | 09:46 | so that SPSS inserts the written
code that produces each analysis, or each
| | 09:51 | display, as you go through.
| | 09:53 | I find it a very helpful thing to do, in
addition to pasting the syntax into a
| | 09:57 | syntax window to be saved separately.
| | 10:00 | The other one that I think is important
is under Output Labels, the second one
| | 10:03 | from the right on the bottom.
| | 10:05 | Output Labels lets you show things as
either the labels that you give them--you
| | 10:09 | may recall for instance we had the
variable called Degree, which had a much
| | 10:13 | longer label about percentage of
population with a bachelors degree or higher.
| | 10:18 | You could either have that long labels
show up in the output and in the tables and
| | 10:23 | in the figures or you could have the
short name, which is just degree, or you
| | 10:28 | can have both of them.
| | 10:30 | Similarly, with the Value Labels, like
for instance, I had whether a state had
| | 10:35 | NFL team, I had 0 as No, 1 as Yes,
| | 10:39 | Labels means you can have the yes's and
the no's shows, but you can also do it
| | 10:43 | as 0s in 1s, and you can
also do it as 0, No, 1, Yes.
| | 10:49 | And I use one or the other
depending on the situation.
| | 10:52 | It can be a good way to keep track of things.
| | 10:54 | It can also be a way of making things more
presentation-ready to use just the labels.
| | 10:59 | And yes, those are the options, and I
encourage you to search through some of
| | 11:02 | those little bit more to see what else is there.
| | 11:05 | So the organization of SPSS, says
there is a superficial similarity to a
| | 11:09 | spreadsheet, but you can see that it
has been developed with an eye towards
| | 11:13 | making statistical graphing and
analysis much faster and more organized.
| | 11:18 | Also, with the option to Paste command
syntax into its own window and save it as
| | 11:23 | part of the output with each procedure,
that makes it much easier to keep track
| | 11:27 | of what you do to share with
others and to repeat analyses.
| | 11:31 | Finally, SPSS's extensive help collection
can make it easy for you to get
| | 11:35 | directions and walkthroughs on
nearly every procedure that SPSS does.
| | 11:39 | In the next video, we will talk about
one other setup process, and that is
| | 11:44 | getting data from an
external spreadsheet into SPSS.
| | Collapse this transcript |
| Reading data from a spreadsheet| 00:00 | While it's possible to enter data
directly into SPSS or download it in the
| | 00:04 | SPSS.sav format, data sets will
often come to you in other formats, such as
| | 00:11 | database files, text files, or
frequently as spreadsheets, and there are
| | 00:15 | actually advantages to this.
| | 00:17 | Files in these other programs, such as
spreadsheets, are usually easier to create
| | 00:21 | and share than our SPSS files.
| | 00:24 | Also, SPSS is well set up to
import data from each of these formats.
| | 00:29 | In this movie, I will show you how to
work with spreadsheets in Microsoft's .xls
| | 00:34 | and .xlsx format from Excel.
| | 00:37 | At the end of the movie, I will point
you to SPSS's excellent instructions and
| | 00:41 | tutorials on importing data
from other sources as well.
| | 00:44 | I'm going to begin by using a data set
that I downloaded from Yahoo Financial
| | 00:49 | about the 2,800 stocks in the NASDAQ
index. This is called NASDAQ.xls. And what we
| | 00:56 | have here is the Symbol, the Name for
each stock, as well as the LastSale Price
| | 01:01 | before I downloaded,
| | 01:03 | the company's Total Market
Capitalization, the Year of its initial public
| | 01:07 | offering, its Sector, and its Industry.
And if we scroll to the right, you can
| | 01:11 | also see a web link for a summary quote.
| | 01:15 | Now to import this into SPSS,
there are few things I need to do.
| | 01:19 | Number one is I am going to get rid of
some information that I just don't want.
| | 01:22 | The information about the summary
quotes here, I don't need that, so I am just
| | 01:26 | going to come up here and I
am going to delete that column.
| | 01:29 | That makes things a little bit simpler.
| | 01:31 | The second thing is I can't have
variables that mix numbers and letters in them
| | 01:37 | or SPSS treats them entirely as
String variables or Word variables.
| | 01:42 | The most egregious example here is the IPOYear.
| | 01:45 | You see it says 1999 at the top, and
then we have several N/As for Not Available,
| | 01:49 | and what I need to do is I need to get
rid of those N/As so SPSS will treat as
| | 01:55 | strictly as a numerical variable.
| | 01:57 | The easiest way to do that is to sort
the column. I just click on a cell in
| | 02:01 | there and come up to Sort, and I see we
go from 1970 and I can just scroll down.
| | 02:06 | There we go. I see I can select all
of the N/As. I start there and come down
| | 02:15 | to row 2821, I Shift+Click, and
then I can just hit Clear Contents.
| | 02:21 | Now I also need to check the other
two dollar values, the LastsSale and the
| | 02:25 | MarketCap, just to double-check.
| | 02:27 | I am going to going to click on
LastSale and I will sort that. See, it goes down
| | 02:33 | to 1 cent. What's up at the top?
| | 02:36 | Okay, I have a few N/As in there too,
and if I left those in there, those three
| | 02:40 | values could turn the 2800 and
18 others into String variables,
| | 02:45 | so I don't want that. I'll press Clear Contents,
and then I have a few here under MarketCap.
| | 02:49 | I will clear those.
| | 02:51 | I am going to sort MarketCap
separately, just to double-check.
| | 02:58 | And look, we have one more right there.
| | 03:02 | Once we have done that, I
believe we are ready to import this.
| | 03:05 | It's okay that I have N/As in Sector
because that's a text variable anyhow.
| | 03:09 | I am just going to come back over to
the first column, Symbol, column A, and sort
| | 03:16 | that by the Symbol again from top to bottom.
| | 03:20 | So we start at the Australia Acquisition Corp.
| | 03:24 | I am going to save this data set, and
then I need to close it because SPSS can't
| | 03:29 | open it if it's open in Excel.
| | 03:31 | So I am going to close the data set,
minimize this, and here I am in SPSS now.
| | 03:36 | If I just come over to File, to Open,
to Data Set, and I simply navigate to the
| | 03:43 | folder where I have this spreadsheet,
| | 03:45 | now I need to tell SPSS that I
am looking for a spreadsheet,
| | 03:48 | because right now it's
trying to report on .sav files.
| | 03:50 | I come down to spreadsheets, and now it shows
up, and I can just double-click on it to open it.
| | 03:56 | It gives me a suggested range of the
data. If there's more than one worksheet in
| | 04:02 | the spreadsheet, it automatically
suggests the first one; but if you have others,
| | 04:06 | you can navigate to them in this way.
| | 04:08 | But I am going to use data--that's the
Name of the worksheet--cells A1 to G2821.
| | 04:15 | I will just press OK, and there we go.
| | 04:19 | You see, for instance, that the
variable names are listed across the top in
| | 04:22 | the blue row and we have the
Symbol, the Name, the LastSale, and the
| | 04:28 | MarketCap, the IPO.
| | 04:30 | Now in IPO I cleared out the N/As,
and those were blank cells in Excel.
| | 04:35 | Here they have dots.
| | 04:36 | A dot is what goes into a
blank numeric cell in SPSS.
| | 04:40 | So actually, that still
indicates that those are missing.
| | 04:42 | I am going to scroll over to the right
for minute and see what else we have. We have
| | 04:47 | Industry. I am going to make that
little skinnier by just dragging it over,.
| | 04:51 | I am going to come back, and I
will take the Name, and I will make that
| | 04:54 | skinnier so I can see more of the data.
| | 04:57 | I do need to fix a couple of things.
The LastSale and the MarketCap are both
| | 05:01 | dollar values, and I need to turn them
into dollar values and change the decimal
| | 05:05 | places for both of them.
| | 05:07 | So what I am going to do as I can
either click on the Variable View tab at the
| | 05:10 | bottom left or I can simply
double-click on the name of the variable.
| | 05:13 | I will do that. And I can go to
Type, until it's a Dollar value.
| | 05:19 | And I will click this one down to the
bottom, just two decimal places, and that
| | 05:26 | should do. The LastSale, the highest
value is in the thousands, but I do need
| | 05:32 | to have two decimal places
because they do use the cents.
| | 05:34 | On the other hand,
MarketCap is huge numbers.
| | 05:38 | It goes up to hundreds of
billions, and I don't need decimal places.
| | 05:43 | I am going to tell that one
that it's a Dollar value as well.
| | 05:45 | I will give it room for a lot of
numbers, but no decimal places.
| | 05:51 | I'm going to click OK, and now I can go back to the
Data view and see what we got--and that
| | 05:56 | looks like the correct format. And now I
can simply save this data file as NASDAQ,
| | 06:07 | and we are good to go.
| | 06:09 | Now I want to show you that SPSS is able to
import straight from databases or text files.
| | 06:15 | In fact, if you go over to File, you
will see here we have a command for opening
| | 06:19 | from a database or reading text data.
| | 06:22 | Now I am not going to go through those.
| | 06:25 | Instead, right now, I am just going to
point you over to the Help menu, to Tutorial.
| | 06:31 | When you click on that, this will open
up web browser, even though it's a local
| | 06:34 | file, and the Tutorial, I want you to see
this one: Reading Data. And in fact, if we
| | 06:41 | open that up, you can see
Reading from a Database.
| | 06:45 | And SPSS has a tutorial that will
walk you through every step that you need,
| | 06:50 | using a similar procedure to get the
data from a database and into SPSS.
| | 06:56 | And so you see, with the proper
preparation, it's a straightforward procedure to
| | 07:01 | get data from one source--a spreadsheet,
a text file, a database--into SPSS, so
| | 07:07 | can begin exploring your data and
seeing what your numbers can tell you.
| | Collapse this transcript |
|
|
2. Charts for One VariableCreating bar charts for categorical variables| 00:00 | Once your data is in SPSS, one of the
best ways to understand it is with charts,
| | 00:04 | and most basic kind of chart is a bar chart.
| | 00:07 | This simply indicates how many
people or cases fall into each
| | 00:11 | particular category.
| | 00:13 | One of the great developments in SPSS
a few versions ago was something called
| | 00:16 | the Chart Builder, which is a
unified interface for nearly every kind of
| | 00:20 | chart that SPSS can make.
| | 00:22 | Now I'm going to show you how to use the
Chart Builder to create a simple bar chart
| | 00:26 | to show frequencies, or how
common particular categories are.
| | 00:31 | I'm using a data set right
here, this is called Movies.sav.
| | 00:35 | This is a data set that I and my
research colleagues put together that included
| | 00:39 | the top grossing movies from each of
several years, as well as movies that won
| | 00:43 | awards in several different
categories, from the Academy Awards.
| | 00:47 | What I'm going to do right here is I'm
simply going to find out how many movies
| | 00:50 | in this are in each different genre.
| | 00:53 | Now this is a text variable, and we're going
to make a bar chart to show the categories.
| | 00:58 | I simply come up to Graphs, to Chart
Builder, and then by default right here
| | 01:03 | it offers bar charts,
| | 01:05 | that's the first one, and I just
want the simplest kind possible.
| | 01:08 | As a general rule, data graphics
are designed to communicate, and they need
| | 01:12 | to communicate clearly, and you want
to use this simplest possible kind of
| | 01:16 | chart that you can make, and
a bar chart is a great one.
| | 01:19 | And all I'm going to do is
I'm going to come over to Genre.
| | 01:22 | Please note it's got the three
little circles that indicates its a nominal
| | 01:25 | variable, and the A says that it's a
text variable, as opposed to the year it
| | 01:29 | released, which is also being treated as
a categorical variable but it's got a
| | 01:32 | number underneath it.
| | 01:33 | So I'm going to just take this out of
the variable list and I'm going to drag it
| | 01:37 | into the canvas to right here under X axis.
| | 01:41 | One of the nice things is that the
canvas automatically changes the Y axis on
| | 01:45 | the side to read Count because that's the most
common thing I would want do with a bar chart.
| | 01:49 | Now I have lots of options here.
| | 01:52 | One thing I can do, for instance, is I can
just simply use the gallery to get lots
| | 01:56 | of different kinds of charts.
| | 01:57 | I'm using the basic one.
| | 01:58 | Now if you can't find what you're
looking for in the gallery, you can actually
| | 02:02 | create a chart out of basic elements.
| | 02:04 | It's a lot of work and we're
not going to cover that one.
| | 02:06 | There may be situations you want to
be able to stick an identifier on a
| | 02:10 | particular data point, and you can do
that here. Or you can add titles and notes.
| | 02:15 | So for instance, I'm going to put
Title 1, "Frequency of Movie Genres in the
| | 02:24 | Dataset." Easy enough, and I press Apply.
| | 02:27 | I can make other categories and other
titles as well, but I'm not going to worry
| | 02:31 | about those right now.
| | 02:32 | All I'm going to do now
is come over and press OK,
| | 02:35 | and when I do to that, I get a large
amount of output here that is the written
| | 02:42 | record of a procedure that I just performed.
| | 02:45 | I get this that says GGraph--
| | 02:47 | that's the kind of graph we're
making--the Source, the data set, and then
| | 02:51 | here's the graph itself.
| | 02:53 | And this shows, for instance, in this
data set that that is based on top grossing
| | 02:56 | movies and award winners, the dramas are
more common than anything else, and that
| | 03:00 | thrillers are the least common,
| | 03:03 | most of this because a lot of these
are drawn on award winners and thrillers
| | 03:06 | win those less frequently than others.
| | 03:08 | Now what I want to show you is there
are ways to clean up these charts and to
| | 03:12 | modify them to make them work little better.
| | 03:14 | Aside from the simple fact that
I think this is an ugly color,
| | 03:17 | there are a lot of things that can be
done to make this more communicative.
| | 03:20 | To enter the chart in SPSS, all you've got to
do is you come over it and you double-click.
| | 03:26 | And this brings up the Chart Editor window.
| | 03:29 | When you're doing your charts, you want
to look at the order that the bars appear in.
| | 03:32 | Now by default, SPSS puts them in
alphabetical order, and there may be situations
| | 03:37 | in which that's appropriate.
| | 03:39 | However, it's usually easier to read charts
| | 03:41 | if you sort the data by their values.
In this case, I'd like to have the most
| | 03:45 | common to the least common, and what I'm
going to do here as I just click on the
| | 03:49 | bars and I come over here to the
Properties window and it says Categories, Sort
| | 03:55 | By, at the moment it says Custom.
I just want to put it as Statistic, and I'm
| | 04:00 | going to make it Descending, and I press Apply.
| | 04:04 | And now I see it goes from Drama, the
most common, to Documentary, to Action, to
| | 04:08 | Foreign, to Comedy, to Animated, to Thriller.
| | 04:11 | If I want to change the colors of these--
these are still selected--I come over
| | 04:15 | to Fill and Border, and I can change
into a color that I find a little nicer.
| | 04:19 | Now personally, I like to use light
colors because I feel that it's easier to
| | 04:24 | see them, but it does not dominate the vision.
| | 04:28 | And so I've changed these to
blue with a blue border as well.
| | 04:31 | Also if you want to make these ones
down here larger, these words, you can
| | 04:36 | simply click on them and come over to Text Size.
| | 04:39 | The preferred size is 8 point, which is
really small, especially since most of
| | 04:43 | the time these charts are going to be
used in presentations, like in PowerPoint,
| | 04:47 | where people are going to be
sitting 20, 30, 40 feet away.
| | 04:49 | So you can change these to
be 12 point, for instance.
| | 04:54 | Now what's happened is that SPSS has
automatically changed them to a staggered layout.
| | 04:59 | That's because they'd run over to each
other were documenters much longer,
| | 05:02 | and animated and much longer.
| | 05:03 | One way to deal with this, and something that I
do frequently, is when I have a chart like this,
| | 05:09 | you can actually come up to the button that
says Transpose the chart coordinate system.
| | 05:15 | If I click on that, it switches the
chart so that the labels are on the left and
| | 05:21 | then the bars go off to the right.
| | 05:23 | Now one thing that's happened to this is
that the most common one is down by the
| | 05:28 | bottom where the axis is.
| | 05:29 | That's not helpful in this kind of
chart, and so I'm going to click on the
| | 05:33 | Categories, I'm going to click on the
bars, go back to Categories, and instead of doing
| | 05:36 | them Descending, I'll do them Ascending.
| | 05:40 | And this puts the most common category on
the top and the least common on the bottom.
| | 05:44 | Also it may be that I don't really feel
like I need this word Genre here in the title.
| | 05:49 | What I can do is I can click on that
and I can come over to Labels & Ticks in
| | 05:53 | the Property window and simply
uncheck Display the axis title.
| | 05:58 | I click that, and the way it works in
SPSS as almost any time you're going to
| | 06:02 | do anything, you then have to apply it.
| | 06:04 | I apply it, and that disappears, and I
find this to be a much cleaner chart.
| | 06:09 | And as a bar chart, it displays very
well. The prevalence of each category,
| | 06:15 | it puts them into a logical order
from most common to least common, the
| | 06:19 | labels are large enough that I can read them,
and I've been able to work on this very nicely.
| | 06:23 | Now once you've set up a chart in a way
that you've modified it a fair amount,
| | 06:28 | if you want to, you can come back to
the Chart Editor and click on File and
| | 06:34 | actually save this as a chart template.
| | 06:36 | And it gives you the option of
saving all of your settings, except
| | 06:41 | I don't want to save all of the Text Content,
so I will undo that, and I can say Continue.
| | 06:47 | And I can simply save it as a Bar
Chart Template Transposed, or whatever you
| | 06:55 | think might be useful for you to find that
template again in the future. I'll click Save,
| | 07:00 | and now I can apply that
template on other charts if I want to.
| | 07:04 | But this is the most basic kind, and
truthfully, one of the most informative kinds
| | 07:08 | of charts, the bar chart, a simple bar
chart, two-dimensional, that communicates
| | 07:13 | the frequency of categories
in a categorical variable.
| | Collapse this transcript |
| Creating pie charts for categorical variables| 00:00 | In the previous movie, I showed you how
to use SPSS's Chart Builder, its unified
| | 00:05 | interface, for nearly every
chart the program can make.
| | 00:07 | And with it, we made a bar chart.
| | 00:10 | In this example, I want to show you
how to make another kind of categorical
| | 00:14 | chart, the pie chart,
| | 00:15 | that's a common choice for categorical variables.
| | 00:17 | The procedure is very
similar to that of bar charts;
| | 00:20 | however, there are a
couple of important differences.
| | 00:23 | These have to do with the demands that pie
charts place on the nature of the data.
| | 00:27 | These are that the data must be
exhaustive and mutually exclusive. What that
| | 00:31 | is is exhaustive means that all the
categories need to cover all of the
| | 00:35 | possibilities and add up to 100%.
| | 00:38 | That may require that you create an Other
response category or a No Response category.
| | 00:42 | Mutually exclusive means that each
person needs to fall into just one category.
| | 00:47 | And while there are many situations
where the condition of mutual exclusivity
| | 00:51 | isn't a problem, for instance a
person can be born in only one country,
| | 00:56 | there are least as many
situations where it doesn't work--
| | 00:58 | for instance, college attended, as many
people have attended more than one.
| | 01:02 | This can create a real limitation in the
applicability of pie charts. Also, there
| | 01:07 | is another issue in that bar charts
are pretty easy to read, because you
| | 01:10 | simply have to be able to judge
the length of the height of a bar.
| | 01:14 | That's a linear measure.
| | 01:16 | Pie charts generally require that a
person be able to judge angles and areas,
| | 01:21 | both of which are rather difficult.
| | 01:23 | And so these are challenges for pie
charts, the demands they place on the data
| | 01:28 | of being exclusive and
comprehensive, and also the interpretability.
| | 01:33 | Nevertheless, they are very common
choices, so I will show you how to do
| | 01:36 | these quickly in SPSS.
| | 01:38 | Like all of the other charts we are
going to do, you want to start by going up
| | 01:41 | to Graphs, to the Chart Builder.
| | 01:44 | From there, on the Gallery list, come
down to Pie. Click on that and just drag
| | 01:49 | the pie up in into the canvas. From
there I'm going to pick Genre and put that
| | 01:55 | down right there, and then I can press OK.
| | 01:57 | Like in the basic bar chart,
it's very colorful.
| | 02:01 | You can see that the yellow slice is
the largest of all--that's Drama--and that
| | 02:06 | the purple is probably the next biggest,
and the others are a little bit smaller.
| | 02:10 | Now there are ways to customize the
pie chart in SPSS, but given that I have
| | 02:15 | think that pie charts are generally
a little harder to read, I generally
| | 02:19 | encourage you to try bar charts instead.
| | 02:22 | But this is another common option for
depicting the categorical variable in SPSS.
| | 02:28 | So creating a pie chart in SPSS is a
simple affair, and it still gives a lot of
| | 02:32 | options to control over how it looks.
| | 02:34 | However, given the challenges of
reading pie charts, and the restrictions they
| | 02:38 | place on the data, you may want to
consider using a bar chart instead.
| | 02:42 | On the other hand, in your corporate
culture, pie charts may be the lingua
| | 02:45 | franca, they may be what's expected,
and you may want to introduce some
| | 02:49 | variety in your charts, so they
can be a viable option in SPSS.
| | Collapse this transcript |
| Creating histograms for quantitative variables| 00:00 | In the last two movies, we looked at two
different kinds of displays you can use
| | 00:04 | for categorical variables.
| | 00:05 | We looked at bar charts
and we looked at pie charts.
| | 00:08 | On the other hand, you may also have
what SPSS calls a scale variable, also
| | 00:13 | called a quantitative, or measured, variable,
| | 00:15 | So for instance the percentage of
critics who favorably endorse their movie, or
| | 00:20 | the budget for the movie, or viewer
evaluations, these are all measured as more
| | 00:23 | or less quantities, and a bar chart
and pie chart won't work for these.
| | 00:27 | Instead, there are generally two
kinds of charts that you want to make.
| | 00:31 | The first one that we're going to do
right now is called a histogram, and it's
| | 00:35 | like a bell curve that shows
the distribution of scores.
| | 00:38 | Let's look at that one right now.
| | 00:40 | Come to Graphs, to Chart Builder, and
from here I come down to Histogram.
| | 00:46 | There are a few variations, but the one
that's most informative is the basic one.
| | 00:50 | I grab it out of the gallery and drag it
into the chart canvas, and from there I
| | 00:55 | simply need to tell it what
variable it is that I want to chart.
| | 00:59 | In this case, I'm going to use Budget.
| | 01:01 | I'm going to drag that down into the X axis.
| | 01:04 | Now by the way, this is not the
real data that SPSS is showing.
| | 01:09 | When it uses a canvas it simply puts in
some kind of random data to let you know
| | 01:14 | that it's not producing
a pie chart or something.
| | 01:16 | Now I have some options here.
| | 01:18 | One of them is whether I need IDs--I
don't think I do--or Titles, and I'm going
| | 01:23 | to put a title on this one.
| | 01:26 | And I'm going to put
"Budget for Movies in Movie.sav."
| | 01:31 | And I'll press Apply, and then I'll press OK.
| | 01:35 | And the Output window first shows the
code that produces this one, and you can
| | 01:40 | save that to rerun this later if you want to.
| | 01:42 | It shows the name of the
command in SPSS. It's GGraph.
| | 01:46 | It shows the data set that
was used to produce this.
| | 01:49 | That's important, especially if you have
more than one data set open at a time, and
| | 01:52 | this is the chart as this
produced by default in SPSS.
| | 01:56 | It's called a histogram.
| | 01:57 | You can see we have a whole lot of movies
in the status that was very small budgets.
| | 02:01 | This is $50 million, $100 million, up
to a quarter billion there on the scale.
| | 02:07 | And this tells us that there are about 23
movies with budgets in the lowest range.
| | 02:12 | That makes sense when you consider
these are a lot of award-winning movies,
| | 02:15 | like animated shorts that people may not
have seen and that don't require a huge budget.
| | 02:21 | On the other hand, this chart is not
particularly attractive, and it's got some
| | 02:25 | communication problem.
| | 02:26 | So what I'm going to do is I'm
going to double-click on the chart.
| | 02:28 | Then I'm going to take this information
right here with the Mean, the Standard
| | 02:32 | Deviation, and Sample Size, and I
don't need that in the chart.
| | 02:34 | I may need that information
elsewhere, but I don't need it here.
| | 02:37 | So I'm going to click on
it and then I hit Delete.
| | 02:39 | Then I see over here I have
frequencies with decimal points on them, and I
| | 02:44 | don't need that there.
| | 02:45 | That's kind of silly. So I can click on
that and then come over here to Number
| | 02:50 | Format and I can put it to zero decimal places.
| | 02:55 | Then, here across the bottom, these are
millions of dollars and truthfully these
| | 02:59 | numbers are hard to read,
because there are so many digits there.
| | 03:03 | What I can do is I can click on that,
and I can come to Number Format, and I can
| | 03:07 | go to Scaling Factor here, and I put
it as Millions, and I press Apply.
| | 03:13 | And now it's much easier to read,
but I need to change this one.
| | 03:16 | It says, Budget. I just click on
that and I'll say, Budget in Millions.
| | 03:21 | Now there are two other
things I want to do here.
| | 03:23 | Number one is, I find this to be in a
very unattractive color, so I'm going to
| | 03:26 | click on it, and since it's money, I
might as well use green for my charts.
| | 03:31 | There is a little curiosity here
about the fact that we have three bars
| | 03:36 | for every $50 million.
| | 03:38 | Now there are some general guidelines for
the number of bins that you should have
| | 03:43 | in a histogram. These are
bins, how wide each bar is.
| | 03:47 | And we've got some gaps here, which
means we might need a few more bins to help
| | 03:51 | smooth out the pattern.
| | 03:53 | Again, the idea here is that every
chart, including histograms, is meant to be a
| | 03:58 | simplification, an abstraction of the data.
| | 04:01 | It needs to be informative and
accurate, but it is a simplification.
| | 04:04 | So sometimes reducing the number of
bins can make it easier to see the patterns
| | 04:08 | without getting overwhelmed by
the complexities of the real data.
| | 04:12 | So what I'm going to do is I've got
these selected already, I'm going to come
| | 04:16 | over to the Properties window,
click on Binning, and then I'm going to
| | 04:19 | come down to Custom, Interval Width.
And what I'm going to do is I'm going to
| | 04:24 | make it so that there are two bars
instead of three for each one of these,
| | 04:26 | so they are each 25 million wide.
| | 04:30 | I believe that's 25 million.
| | 04:32 | And now we have just two bars per gap,
and it smoothes things out a little bit.
| | 04:36 | And what you can see is that most of
the movies in this particular data set of
| | 04:40 | award winners and top grossers
have budgets between 0 and 25 million.
| | 04:44 | There are some very low-budget movies.
| | 04:46 | This again, these short movies are some
animated movies, and then we have some very
| | 04:51 | large summer blockbusters with
budgets of $150 million or $200 million.
| | 04:56 | It's a good way of seeing
what the distribution is like.
| | 04:59 | When I'm done modifying the variable,
if I want to, I can come to File and I can
| | 05:04 | save that template and I can use it again later.
| | 05:07 | I'm not going to do that right now.
| | 05:10 | And then when I'm done editing the chart,
I can simply press the X and close the
| | 05:14 | chart, and there's my finished
chart that I can export later.
| | 05:17 | And again, a histogram is the first of
two charts that you should generally use
| | 05:22 | when you're looking at scale data.
| | 05:24 | The other one, which we'll cover in
the next movie, is a box plot, which is
| | 05:27 | ideal for looking at outliers in
distributions in which we appear to have in
| | 05:31 | this particular one.
| | 05:32 | But both of these charts are a great
way of getting the feel for the shape of
| | 05:36 | a distribution of a scaled
variable, and give you a better idea of how
| | 05:40 | well you meet the statistical
assumptions of tests that you're going to be
| | 05:43 | performing later on them.
| | Collapse this transcript |
| Creating box plots for quantitative variables| 00:00 | When you're looking at what SPSS calls
a scale variable--that's something that
| | 00:04 | can be measured as more or less, like
the percentage of critics who gave a
| | 00:08 | favorable rating to a movie or the
budget or the box office earnings for that
| | 00:12 | movie--you should
generally make two kinds of charts.
| | 00:15 | The first one, which we did in the
last movie, is called a histogram.
| | 00:19 | It's like a bell curve, and it's a
good way of getting a feel for the overall
| | 00:22 | shape of a distribution.
| | 00:24 | The second kind that you should
generally make for a scale variable is called a
| | 00:27 | box plot, and it's primary purpose in
this context is to check for outlying
| | 00:32 | scores, because they can cause a lot of
problems in later statistical analyses.
| | 00:37 | So you need to be able to identify
whether you have outliers and often
| | 00:41 | what those outliers are.
| | 00:43 | So what I'm going to do now is I'm
going to create a box plot for budget, which
| | 00:47 | we used in the last movie on histograms.
| | 00:51 | Come up to Graphs, to the Chart Builder, and
from there I come down to the list, to Boxplot.
| | 00:56 | There are several
different versions of box plots.
| | 00:59 | I am going to choose the simplest one possible.
| | 01:01 | That's this one over here,
which is called a 1-D Boxplot.
| | 01:04 | It's for charting all of the
cases on a single variable.
| | 01:08 | If I wanted to break down budgets by a
genre of film, I could do that over here,
| | 01:14 | under what's called a Simple Boxplot, but it's
grouped, and I will show that in a later movie.
| | 01:18 | But right now I'm simply going to drag
the 1-D Boxplot up to the canvas, and then
| | 01:24 | I'm going to bring in budget to the Y axis.
| | 01:28 | This is the general format of a box plot.
| | 01:30 | I will explain more when we
look at the finished version.
| | 01:34 | But I am going to do a couple of things.
| | 01:37 | Number one is I may want to identify points.
| | 01:40 | If click on Point ID Label, and then I
can actually get the movie name and I can
| | 01:45 | drag that into here,
| | 01:46 | so if I have unusually high or low points,
it will actually tell me what the movie is.
| | 01:51 | It makes life easier.
| | 01:53 | I can also put titles on.
| | 01:54 | I will have a title, and I will
put Boxplot of Movie Budgets.
| | 01:59 | Then I will press Apply, and for both
of these I can now press OK over here.
| | 02:06 | And what comes up is this particular chart.
| | 02:09 | This is the text that is the
syntax that produces the command.
| | 02:13 | This is the name of the command,
| | 02:14 | this is the data set, Movies.sav, and
this is the Boxplot of Movie Budgets.
| | 02:19 | What you have here is budgets ranging from 0--
| | 02:23 | there's actually nothing with 0--
up to $250 million for the movie.
| | 02:27 | This is from a few years ago.
| | 02:30 | And this box right here shows the
quartiles of a distribution, and this is the
| | 02:35 | minimum value of any movie in the data set.
| | 02:40 | This right here is the highest non-
outlying value, and I say non-outlying because
| | 02:45 | we have two outlier movies.
| | 02:47 | In this particularly data set
Spiderman 2 and King Kong both had budgets of
| | 02:53 | approximately $200 million.
| | 02:56 | On the other hand, this box down here
shows you the median, that 50% of the
| | 03:01 | movies--there were 61 in this data set,
so 30 of them--had budgets beneath this,
| | 03:06 | which is around $25 or $30
million, and half of them were above.
| | 03:11 | Now, I am going to show you a few ways
to modify this chart that I think will
| | 03:16 | make it a little easier to deal with.
| | 03:17 | As with every chart in SPSS, you modify it by
first double-clicking on it to activate it.
| | 03:24 | That brings up the chart in a Chart
Editor window and it brings up a Properties
| | 03:28 | window to the right.
| | 03:29 | Now, one thing that I personally like
to do is I like to turn these charts
| | 03:33 | sideways by coming up to the button
bar and clicking on the button that says
| | 03:37 | "Transpose the chart coordinate system."
| | 03:40 | The reason I do this is because the
other charts that we make up, like
| | 03:43 | histograms and like the scatter plots
that we will show later, they have these
| | 03:47 | variables listed across the bottom,
with the lowest value on the left,
| | 03:51 | highest value on the right, and I find it
helpful to be consistent in this particular way.
| | 03:55 | I'd like to change the color of the chart.
| | 03:58 | I click on the box, come over here to
change the fill, and then the border
| | 04:03 | I can change to another color if I want.
| | 04:05 | I can change the way these bars work at the end.
| | 04:08 | These are sometimes called whiskers.
| | 04:10 | They go to the lowest and
the highest non-outlying value.
| | 04:14 | In case you're wondering, outliers are
determined by being one and a half times
| | 04:19 | of this middle range above or below the range.
| | 04:22 | What we're going to do is I'm going
to change the way these whiskers are.
| | 04:26 | This is just a preference issue.
| | 04:27 | I click on that, and I come over her to
Bar Options, and I am going to change it
| | 04:32 | from a T-bar to what's called a Whisker.
| | 04:34 | It's just a line at the end.
| | 04:37 | And then here, if I want to, I can actually
change the way that these look at the end.
| | 04:43 | I have the movie labels there as well.
| | 04:45 | Finally, if I want to change the Axis
labels here on the bottom, like I did with
| | 04:48 | the histogram where I changed these
to millions of dollars, I click on the
| | 04:51 | numbers, and I come over to the
Properties window, to Number Format, and the
| | 04:56 | Scaling Factor here,
I'm going to put in millions.
| | 04:59 | I am going to press Apply, and
this now gives me millions of dollars.
| | 05:04 | And I need to change this--
| | 05:05 | it says Budget--to say Budget in Millions.
| | 05:08 | I can close the chart, and now I
have a good depiction that the overall
| | 05:15 | distribution is on the low end,
because this is movies that included award
| | 05:19 | winners, that half of the movies have
budgets of 30 million or less, but they go
| | 05:26 | up to about 150 million, and that in
this particular data set we had two other
| | 05:30 | movies--Spiderman 2 and King Kong--
that had unusually large budgets, as is
| | 05:34 | common among summer blockbusters.
| | 05:37 | Anyhow, when you're looking at a
scale variable like budget, like viewer
| | 05:41 | evaluations, like time spent on tasks,
like time spent viewing a web site, then
| | 05:47 | you do want to look at both the
overall shape of the distribution with the
| | 05:50 | histogram and you want to check for
outliers, and a box plot is an ideal way
| | 05:55 | to do that.
| | Collapse this transcript |
|
|
3. Modifying DataRecoding variables| 00:00 | Many times your data won't come in
exactly the form that you need it for analysis.
| | 00:06 | For example, you may have groups
that need to be combined, or you may have
| | 00:09 | outcomes that need to be counted or
scores that need to be reversed to be more
| | 00:12 | interpretable in your results.
| | 00:14 | All of these fall in the
general rubric of recoding variables.
| | 00:18 | There are several ways to do this in SPSS.
| | 00:20 | The first way that I want to show you
in particular movie is what you might
| | 00:24 | call a manual recode.
| | 00:25 | And the way you do this is by coming
up to the Transform menu and then you
| | 00:31 | select either Recode into Same
Variable or Recode into Different Variables.
| | 00:35 | Now let me give you a quick warning here,
when you recode into the same variable
| | 00:40 | you're overwriting existing data, and
while that maybe able to save some space,
| | 00:45 | if you make a mistake in the recode, you
will not be able to go back to what you
| | 00:50 | had before. And for that reason, I
recommend that you almost always recode into a
| | 00:55 | different variable, which is what I
am going to do in this particular case.
| | 00:59 | By the way, the one I'm going to
look at is this one here at the end.
| | 01:03 | It's called In the Past 30 Days Have
You Felt Worthless? and there are several
| | 01:08 | responses that go from Never to Almost
Every Day and what I am going to do in
| | 01:14 | this particular one is I am going
to recode it as people who have never
| | 01:17 | versus at least sometimes.
| | 01:19 | So I am going to be taking all of the
answers above zero and making them into
| | 01:24 | a single Yes code, that they have felt worthless
at at least some point in the past few weeks.
| | 01:30 | So what I do is I start by taking this variable.
| | 01:32 | It's called Numeric Variable and it's
FeelWorthless, and I am going to create--
| | 01:36 | and I call it, EverFeltWorthless
because the other one asked about how often.
| | 01:39 | This one is going to be "Have you ever?"
and I am going to be put in the label for
| | 01:42 | this one and I am going to call it Has
EverFeltWorthless. And I click Change and
| | 01:48 | now it puts FeelWorthless would
be coded into EverFeltWorthless.
| | 01:52 | Then what I need to do is I need to specify
the old and the new values for the recode.
| | 01:57 | Well, what I am going to do in this
one is I am going to take zero, and that's
| | 02:01 | going to stay zero, so those are the
people who said they never felt worthless--
| | 02:05 | that's going to stay that way--but then
what I am going to do is I am going to
| | 02:08 | specify a range, and I am going to put
anything 1 through the highest value, so
| | 02:15 | that's 2s, 3s and 4s, that
that any of those can become a 1.
| | 02:20 | Now this new one I am creating is
going to be called an Indicator Variable.
| | 02:23 | That's a 0/1, yes/no variable.
| | 02:26 | It's a good way to do it because it
allows you to also do certain numerical
| | 02:31 | statistical procedures with it.
| | 02:33 | Now if I wanted to set up a more
detailed correspondence, I could.
| | 02:36 | Say for instance, I had a variable in
an opinion survey that was coded as 1
| | 02:41 | strongly disagree, up to 5 strongly
agree, but then it was reverse-coded
| | 02:46 | so that, for instance, in this particular case,
people are talking about what they did not like.
| | 02:51 | In order to make things consistent, I
may need to switch it around, and I may need to
| | 02:54 | switch 1 to 5, 2 to 4,
3 stays the same, 4 to 2, and 5 to 1.
| | 03:00 | I can do that by putting in each one
of these manually, but because I have a
| | 03:04 | pattern here where I am putting 0
stays 0 and everything else goes to a 1, I
| | 03:08 | can do this particular method.
| | 03:10 | 0 stays a 0, but everything else goes to a 1.
| | 03:16 | Now that that's done, I can press OK
and this is the syntax statement.
| | 03:22 | The command is RECODE. It says Worthless (0=0)
1, so everything else equals to 1, into
| | 03:28 | the new variable and then it has a
label for the variable. The long name of it
| | 03:33 | is EverFeltWorthless and then I turn
that into a sentence, or phrase, for the
| | 03:37 | label. And then the EXECUTE
means it actually did the command.
| | 03:40 | Now, you don't see anything else here because
this doesn't produce a graph. It adds a column.
| | 03:45 | It adds a variable to the data label.
| | 03:46 | So if we go back to the data set and
I go to the end, now you'll see a new
| | 03:52 | variable here called
EverFeltWorthless, and it's made out of 0s and 1s.
| | 03:56 | Now I need to do a couple of
things to clean this up here.
| | 03:58 | Number one is it's got these decimal
places that I don't need, because I don't
| | 04:03 | have any 1.5s, I just have 1s and 0s.
| | 04:04 | So I am going to come down to Variable
view and I am going to change that 1 to
| | 04:10 | have 0 decimal places.
| | 04:11 | Also, I want to indicate that
the 0 means no and 1 means yes,
| | 04:16 | so I am going to come over to Values,
click on that, click on the little
| | 04:20 | box here, and I am going to type in value.
I am going to put in a 0 and say that that means no.
| | 04:26 | Click Add, and then I come back
up to 1, and the Label is Yes.
| | 04:31 | When I click OK, and that adds the labels.
| | 04:35 | Now I go back here. I can see those on here now.
| | 04:39 | So what I have done is I've taken an
existing variable and if I click back here
| | 04:44 | on the Value Labels button, you can see
that I had 0s and 4s and so on that have
| | 04:50 | all become 0s and 1s.
| | 04:53 | So I've gone from something that had
a very small number of people on the
| | 04:56 | high end to trying to create groups
that were slightly larger for working with
| | 05:00 | by people who said they ever felt worthless.
| | 05:03 | This by that way is from
the general social survey.
| | 05:05 | It's national survey of people
across the country of all age ranges.
| | 05:09 | And this is one way to save this coding
to get a variable that's more useful in
| | 05:16 | particular analyses.
| | 05:18 | Now in the next couple of videos I am
going to show you how to use something
| | 05:21 | called visual binning and then
something called ranking, and those are two other
| | 05:25 | methods of taking the information that
you have and putting it into a system
| | 05:30 | that would work better for the
analyses that you are going to do.
| | Collapse this transcript |
| Recoding with visual binning| 00:00 | In our last video, we talked about one
method of recoding variables, or taking
| | 00:05 | the data in its existing format and
changing it into another that may be more
| | 00:11 | amenable to a particular
graphic or a statistical analysis.
| | 00:14 | In the last movie, we looked at what
might be called Manual Recode by using the
| | 00:19 | Transform command to recode
into a different variable.
| | 00:24 | In this movie, we are going to look at
another one that's called Visual Binning.
| | 00:27 | It's one of pretty attractive features of SPSS.
| | 00:32 | We do this by coming up to Transform,
coming down into Visual Binning. And you
| | 00:39 | take a variable that has a wide range
of scores--in this particular one, I'll
| | 00:43 | take Age and I'll put that into
Variables to Bin and press Continue.
| | 00:50 | And what this shows me is the age range
or the people in this particular sample.
| | 00:55 | It goes from a minimum of 18
to a maximum of 87 years old.
| | 01:00 | This is a national sample of
adults and so this isn't surprising.
| | 01:05 | Now, there may be times when I want
to break this down into groups.
| | 01:09 | For instance, I have one particular
procedure where you like to take variables
| | 01:14 | like this and you want to break them into
actually five even groups that are called quintiles,
| | 01:18 | even meaning it's the same
number of people in each group.
| | 01:22 | The Visual Binning is a perfect way to do this.
| | 01:25 | Now I need to do something right here.
| | 01:27 | We are going to be creating a new
variable and it already knows to call out Age
| | 01:31 | (Binned) into different bins.
| | 01:33 | I am just going to call that, Age_Bin.
| | 01:37 | And then what I do is I need to come
down and have SPSS create cutpoints or
| | 01:44 | different ways of separating the distribution.
| | 01:46 | I come down here to Make Cutpoints,
and I can tell it to make the intervals of
| | 01:54 | even sizes, say for instance the 20 to
30 year olds, the 30 to 40 year olds, and so
| | 01:59 | on, and that's one possibility.
| | 02:00 | And maybe I would want to do that.
| | 02:02 | I could say let's start the first
one at 20 and then do it every 10.
| | 02:06 | The one I'm thinking of is where I
want to create five equal-size groups as I
| | 02:11 | need four cutpoints to create five groups.
| | 02:14 | See, right here it says,
"N cutpoints produce N+1 intervals."
| | 02:18 | And so what I'm going to do is I am
going to create four cutpoints, and each
| | 02:23 | one of them will have 20% of the sample, because
there is five of them total, so that's 100%.
| | 02:27 | I click OK, and what SPSS
has done here is put in dividers that each
| | 02:37 | has the same number of people.
| | 02:39 | Now, some of these dividers will be
closer, some will be further apart, because
| | 02:43 | there aren't as many people in that group.
| | 02:45 | So for instance you see in the 30 to
40 range, they're pretty close because
| | 02:49 | there's a lot of people right there,
| | 02:52 | similarly in the 40 to 47 group.
But we have from 62 on up to get the
| | 02:57 | same number of people.
| | 02:59 | Now, these are automatically created.
| | 03:01 | It may be however that I look at them
and I say that yes, these are exactly
| | 03:06 | equal groups, there are a number of
people in each one, but I may want them to
| | 03:09 | be slightly different.
| | 03:11 | Maybe I don't want to have the
last group start at 61, I think that
| | 03:14 | sounds little silly.
| | 03:15 | Maybe I'd want to change it to be
exactly 60, and I'd want the other ones to
| | 03:19 | change to be slightly different.
| | 03:20 | So I can actually grab them and move them,
ever so slightly, to be what I want them to be.
| | 03:33 | Or I could try typing them in, to make
sure they get exactly where I want them.
| | 03:37 | I could change that to 40. I could
leave the 47 where it is. I can double-click
| | 03:45 | that one and change it to 60,
and the last group is higher.
| | 03:50 | And now I've got the cutpoints,
and these are approximately equal groups; I
| | 03:53 | changed them only slightly.
| | 03:55 | Another neat thing is this is going
to create a new variable called Age_Bin
| | 03:58 | and these are the values, 1, 2, 3, 4 and 5,
because I have created five different groups.
| | 04:04 | I can also create labels automatically
by clicking on Make Labels right here, and
| | 04:09 | when I do that, it says that the first
group is less than or equal to 30, then
| | 04:13 | 31 to 40, 61+, and so on. And all I need
to do now is press OK, and it tells me
| | 04:22 | that it has created one
new variable in my data set.
| | 04:28 | This is the history of the command.
If I were to write it out by writing code,
| | 04:33 | this is what I would do.
| | 04:34 | But if I go back to my data set, I come
to the end, and I see that I have a new
| | 04:43 | variable here called Age_Bin that has
the numbers 1 through 5 in it. And if I go
| | 04:48 | straight above here to the button bar
and click on Value Labels, you can see the
| | 04:52 | label that shows each size group.
| | 04:55 | And so the Visual Binning procedure is
a wonderful feature of SPSS that allows
| | 05:00 | you to create a new variable by
grouping people on another scaled variable.
| | 05:06 | This can save a lot of time when
you're trying to create groups of particular
| | 05:09 | sizes or split things up into
particular intervals, like every 10 years.
| | 05:14 | And so this is the second way we're
looking at in terms of recoding variables.
| | 05:19 | I am going to show you another one
and it is called ranking variables which
| | 05:23 | works in a pretty predictable way.
But between the three of those, you should be
| | 05:27 | able to do a fair amount in terms of
getting the data into the form that you
| | 05:30 | need them for your statistical analyses.
| | Collapse this transcript |
| Recoding by ranking cases| 00:00 | In the last two videos we
looked at ways that SPSS offers for
| | 00:04 | recoding variables.
| | 00:06 | For instance you can take a variable
that comes in one particular form, like the
| | 00:10 | words male and female, and recode it
into another variable that has zeros and
| | 00:15 | ones, an indicator variable that's
useful in a lot of other analyses.
| | 00:19 | Or we could do something called Visual
Binning where we take ages and we create
| | 00:24 | groups of ages to get, in this
particular example, categories with approximately
| | 00:28 | the same number of people,
five categories or quintiles.
| | 00:31 | A third option that SPSS offers that I
am going to talk about right now is a
| | 00:35 | particularly popular one.
| | 00:37 | It's called its ranking and all it is
is ranking people from first to last on a
| | 00:42 | particular variable.
| | 00:43 | So for instance in this example I'm
going to take the Age variable again and I'm
| | 00:47 | going to rank people from
the youngest to the oldest.
| | 00:51 | Now what this does is it
numbers people from first to last.
| | 00:56 | Theoretically, it could number people from
one to 349, because that's how many cases I have.
| | 01:01 | However, we do have tied values, people
with the same age, and I'll show you how
| | 01:06 | SPSS deals with that when
doing a recode by ranking scores.
| | 01:10 | What I am going to do is I am going to come
up to Transform and come down to RankCases.
| | 01:16 | Then I am going to pick the variable
that I want, in this case it is Age,
| | 01:21 | and move it over here.
| | 01:22 | And you'll see you can do more
than one at a time if you wanted.
| | 01:26 | We could also get summary tables.
| | 01:28 | You also get to decide whether you wanted
the first place, the number one, to be
| | 01:32 | the smallest or the largest value,
and in this case I'm going to give the one to
| | 01:37 | the youngest person,
so I am going to leave it at the smallest value.
| | 01:40 | However, there are several
ways of dealing with rankings.
| | 01:44 | The number one is just a
straight ahead normal Rank.
| | 01:47 | So it would go from 1 to, for example, 349.
| | 01:51 | On the other hand, we can also have
something over here that's called a
| | 01:54 | Fractional rank as a percentage,
and this would be like percentiles.
| | 01:58 | So if you've taken a test, you know
that you can get into the 95th
| | 02:02 | percentile. You don't even know what
really the highest score is on, but you
| | 02:06 | know where it stands relative to others.
| | 02:07 | We can do the same thing with Age here.
| | 02:10 | This would give people percentile
scores on their age. Are they the oldest,
| | 02:13 | youngest, in the 80th percentile, or so on.
| | 02:16 | Similarly, I have the option of creating
Ntiles or quartiles or quintiles, like I
| | 02:21 | did in the last one.
| | 02:22 | I could have done this instead by
telling it to create five equal groups.
| | 02:26 | If I clicked on this one and put 5,
it would do the five equal groups, which was
| | 02:30 | sort of what I was doing in the last one.
| | 02:33 | The Savage score and the Sum of rank
cases as well as the Proportional estimates
| | 02:37 | and Normal scores are rather
sophisticated things, and I don't think that we need
| | 02:40 | to get involved in these.
| | 02:41 | I want to do the simplest
form of ranking at this moment.
| | 02:45 | So I am just going to leave it at Rank
to default and press Continue, but I then
| | 02:49 | need to decide what to do with tied scores.
| | 02:52 | I've got a few options.
| | 02:53 | Number one is to give them the mean.
| | 02:55 | So I have people tied for seventh,
eighth, and ninth, it would give all of
| | 02:59 | them a rank of eighth. Or I could have it give
them all a rank of seventh or all a rank of ninth.
| | 03:07 | And so there are a few different options.
| | 03:09 | I think what I am going to do in this one
is I am going to do them all as the lowest.
| | 03:14 | So it will be ranking them by age
categories in this particular example.
| | 03:18 | The Mean would make sense in other ones,
but for Age, I think assigning the tie to
| | 03:22 | the lower score would be the better choice.
| | 03:24 | So I am going to press Continue.
| | 03:27 | Now I also have an option of breaking
things down by some other category.
| | 03:31 | For instance I could do Gender, where I
have people ranked as oldest to youngest
| | 03:35 | for men and similarly for women.
| | 03:37 | I am not going to do in this
case, but that is an option.
| | 03:40 | It would still create a single
column of ranks. It's just I would need to
| | 03:44 | separate them later by gender when I did them.
| | 03:46 | So all I need to do now is press OK
and it tells me that it has created a new
| | 03:52 | variable from Age to Rank, and it's
called RAge, R for Rank, Age, and it has a
| | 03:58 | label on there. And if I go back to
the data set, I can see it right here.
| | 04:03 | If I hover over that, I could see
that it's called a Rank of Age, and then
| | 04:07 | here I see the ranks.
| | 04:08 | I can scroll up and down.
| | 04:10 | I see that I don't need three zeroes.
| | 04:13 | If I add average scores,
I would probably need those.
| | 04:16 | So what I am going to do is I am going
to come back over to Variable view, go
| | 04:20 | down to RAge, and just
remove the three decimal places.
| | 04:24 | And when I do that, I have
everybody ranked from highest to lowest.
| | 04:29 | In fact, if I want to verify how this
works, I can just right-click on this and
| | 04:34 | I can say Sort this ascending,
| | 04:37 | so the lowest scores will be at the top.
| | 04:39 | And you see for instance, I have these
people here all fall into these 30 and
| | 04:44 | under group, which makes sense
because they should be the youngest.
| | 04:46 | As it goes up, I get people in the 40s
to 61 to plus, and that is the highest
| | 04:54 | group, and there is a confirmation that the
rank performed the way I had intended it to.
| | 04:59 | And so the ranking of cases is a third
option for recoding, along with the manual
| | 05:04 | recode that we did earlier,
as well as the Visual Binning.
| | 05:07 | And it can be a good way of making sure
that your data both meet the assumptions
| | 05:12 | of a statistical test, that they fall
into a form that's easier to show in the
| | 05:17 | graphs and analyses.
| | 05:19 | Ultimately, it makes the results easier
to communicate with other people, which
| | 05:23 | is the goal of a statistical analysis.
| | Collapse this transcript |
| Computing new variables| 00:00 | When you enter or import data into SPSS,
you may want to know a person's average
| | 00:05 | score on a series of variables, but it's
usually a good idea to bring in the raw
| | 00:09 | data and not a summarized version.
| | 00:11 | That way you can recode or
modify from the original information.
| | 00:15 | Also some procedures, such as
calculating something called the internal
| | 00:20 | reliability of a questionnaire, those
procedures may require the complete raw data.
| | 00:25 | Once you bring the data in though and
recode it as necessary, you can then
| | 00:28 | compute the average scores, or a
maximum, or spread, or whatever interests you,
| | 00:33 | using SPSS's extremely flexible Compute command.
| | 00:37 | I am going to do this by using the GSS
data set that asks people if whether in
| | 00:42 | the last year they had seen a classical
music or opera performance, or they had
| | 00:47 | attended a live performance of pop
music, or they had attended a dance
| | 00:51 | performance in last year, seen a live
drama, or even just read a novel or poem or
| | 00:56 | a play in the last year, and then saw art.
| | 01:00 | And so we have here a series of sort
of cultural indicators, and one thing we
| | 01:04 | might want to do is add up how many of
these things people say they've done to
| | 01:09 | get a rough index on cultural involvement.
| | 01:13 | One way to do this with Compute
variable is to simply add these up, and I can do
| | 01:17 | that even though it says the words yes
and no here. If I come back up to the
| | 01:22 | button Value Labels and click on it,
you can see that I have zeros and ones
| | 01:27 | underneath, and the nice thing about
that, and this is why we prefer the
| | 01:31 | indicator variables, is
I can simply add them up.
| | 01:34 | I can simply get a sum for these variables
and find out how many of these people have done.
| | 01:40 | I'm going to first create a space
for this variable. Now you normally
| | 01:44 | don't need to do this.
| | 01:45 | It would simply add the variable at end.
| | 01:47 | But I'd like the variable to be
right here next to the other ones,
| | 01:51 | so what I am going to do is I'm going to
come to the end of that list and now at
| | 01:56 | Happy, I am going to right click on it
and insert a new variable. And I am going
| | 02:01 | to double-click on that variable to edit it.
| | 02:03 | It will bring up Variable view, and
click right here under Name, backspace, and
| | 02:11 | I'll change name to ArtTotal.
| | 02:15 | I can leave the width at 8.
| | 02:17 | I'll change the decimals to 0, because
these are all integer values, and I'll add
| | 02:22 | a label, Art Forms Participated.
| | 02:25 | I'll also change this over here to a
Scale variable, and I will change it to be
| | 02:31 | both an Input and a Target variable, so
I can use it either way by saying both.
| | 02:36 | Come back to the Data view, and I'll
save the data. And now what I am going to do
| | 02:42 | is I am going to create a command
that will add up these 1, 2, 3, 4, 5, 6
| | 02:49 | variables and create a score here,
| | 02:50 | so it'll go from zero to six.
| | 02:54 | Go to Transform to Compute, and
it asks me for the TargetVariable.
| | 03:00 | Now I've already created it,
| | 03:02 | so I can simply write here ArtTotal.
| | 03:07 | And then it's going to ask for a
numerical expression that's a formula.
| | 03:11 | Now you can get very
sophisticated formulas in SPSS.
| | 03:14 | For instance, I can get an
exponent, or I can do the modulus.
| | 03:21 | In fact in a couple of videos from now,
I am going to show you how to use the
| | 03:25 | logarithmic function as a
way of dealing with outliers.
| | 03:28 | But all I really want right
now is a very, very simple one.
| | 03:31 | All I need is the sum.
| | 03:32 | I'll go to Function group. Then I come
down here in the Functions and Special
| | 03:37 | Variables list till I find Sum, and if
double-click on that, it adds to the
| | 03:44 | numeric expression and then asks, what
is it that I'm going to be adding up?
| | 03:48 | I can back up and remove those, and then
I can select the variables that I want
| | 03:52 | to be included in the sum.
| | 03:54 | I want this variable, SawClassical,
and I can add each one of these with a
| | 04:00 | comma between them.
| | 04:01 | I can go like this, and I can add
another one. But because these variables are
| | 04:07 | sequential in the data file, I can
actually use a shortcut expression.
| | 04:10 | I can just list the first one, and then I
can put space and write the word "to" and
| | 04:16 | then the last one is "SawArt."
| | 04:19 | And once I have that, it says to add
up the scores on all these variables.
| | 04:23 | Because they are 0/1 indicator
variables, the sum will simply be how many of
| | 04:27 | these did people say
they've done in the last year.
| | 04:30 | If I want to, I can make it so that it
only calculates it for particular cases,
| | 04:33 | for instance for just man or for
people who are over particular age. I don't
| | 04:38 | need to do that so I am going to
leave it alone. I'll just press OK.
| | 04:41 | And it asks me if I want to
change the existing variable.
| | 04:43 | Now, there's nothing there
because I created a blank variable,
| | 04:46 | so I can just click OK.
| | 04:48 | It writes down that it did COMPUTE,
that the new variable ArtTotal is equal to
| | 04:53 | the sum of SawClassical to SawArt, and
then execute to actually create that.
| | 04:58 | When I go to the data set, I see a new
variable right here with scores from 0,
| | 05:02 | there is a 5, I don't know if we
have any 6s, I can check that out.
| | 05:06 | But now I've created a new variable
that combines the results of these various
| | 05:12 | cultural indicators to give me a
single variable that I can use in further
| | 05:16 | analyses, a way of correlating with
other variables and trying to get an idea of
| | 05:22 | who might be more or less
involved in arts and cultural activities.
| | 05:26 | And so the Compute command is a very
flexible one, a great way of reforming
| | 05:31 | the data to get it in the manner that
can be most useful for your particular
| | 05:35 | analyses.
| | Collapse this transcript |
| Combining or excluding outliers| 00:00 | When you start looking at your data
one of the problems you might have to
| | 00:03 | deal with is outliers. These are
extreme scores, like somebody who is 7 feet
| | 00:08 | tall or somebody who has 26 children
or unusual categories, like being Nepali
| | 00:14 | or a Latin Poetry Major.
| | 00:16 | Now sometimes these unusual scores or
categories are inherently interesting, like
| | 00:20 | with world records or gifted
and talented programs in schools.
| | 00:24 | In other situations, however, they can
wreak havoc with statistical procedures
| | 00:28 | that might be designed to look at
general patterns, or overall trends.
| | 00:32 | In the latter case, where you may be
interested more in common scores than
| | 00:36 | in uncommon scores, you have a few choices
on how to deal responsibly with the outliers.
| | 00:42 | Now the first question
is how to define outliers.
| | 00:45 | Now we've already looked at one way
of getting a graphical definition of
| | 00:49 | outliers on a scale
variable, and it's with a box plot.
| | 00:52 | I am going to come up to Graphs, to
Chart Builder, to Boxplot. I will drag in
| | 00:59 | the 1D Boxplot, and let's
look at Market Capitalization.
| | 01:04 | Also, because we have convenient stock
symbols over here, I am going to ask for a
| | 01:11 | Point ID so I know who the outliers are.
I will just drag that over here and
| | 01:17 | press OK, and what we see is that the
variable for Market Capitalization is
| | 01:23 | extraordinarily skewed, and in fact
they often call this pathological skewed.
| | 01:27 | We have Apple here with over $300
billion in market capitalization,
| | 01:31 | Microsoft, Oracle, and Google, and it
just goes down. And we have this huge
| | 01:37 | number of companies that are stuck in
a tiny level of market capitalization
| | 01:41 | relatively speaking.
| | 01:42 | In fact, we have no idea what the
median or the mean is because those other
| | 01:47 | scores all get squished together so much
| | 01:49 | that there is 2800 companies in the
NASDAQ listing, but we have these extreme
| | 01:54 | outliers that are squishing all the others,
| | 01:56 | that is not possible to
really see what's going on.
| | 01:59 | So we know that we have
outliers here on a scale variable.
| | 02:02 | Now on a categorical variable, like for
instance ethnicity, what you then have as
| | 02:08 | a definition for categorical outliers
is that any group that has, for instance,
| | 02:13 | less than 10% of the overall sample
would be considered a categorical outlier.
| | 02:18 | In that situation you have the choice of
combining them with other categories and
| | 02:22 | creating a sort of Other category
except that it has to be very heterogeneous
| | 02:27 | group. That or you simply don't
analyze by that variable in the future.
| | 02:31 | But let's talk about what
to do with a scale variable.
| | 02:35 | Now if you don't have very many
outliers, or that they're not very far
| | 02:39 | away, you can leave them in. You could
take them as legitimate values and you
| | 02:44 | could proceed with that
understanding, as long as you communicate it
| | 02:48 | adequately with others.
| | 02:50 | On the other hand, another
choice is to exclude them.
| | 02:54 | Now I don't necessarily mean delete
them permanently from the data set, but you
| | 02:58 | can create a selector. We've done this before.
| | 03:00 | I should just mention right here, this
is $100 billion, and we still have a
| | 03:04 | huge number of companies right there.
| | 03:05 | I am going to select a much smaller number.
| | 03:07 | I am going to go to
$100 million capitalization.
| | 03:10 | So I am going to go to Data, to Select Cases.
| | 03:14 | Select Cases if your market capitalization
is less than 100 million and press Continue.
| | 03:24 | Now I have the option of
just filtering them out.
| | 03:27 | That creates a new variable that
temporarily excludes or deleting them
| | 03:30 | permanently, and I don't want to do that.
| | 03:32 | I am just going to filter them out right now.
| | 03:34 | So I am going to press OK, and it tells
me that it has done that selection. And in
| | 03:38 | fact, if I go back to the data set I will
see that these cases got, for instance,
| | 03:43 | Apple has been selected out.
There is a variable here at the end now.
| | 03:47 | There's a filter variable, and if I
click on the value labels, I can see there
| | 03:51 | are cases that are selected or not selected.
| | 03:54 | And now I am going to go back, and I am
going to do my box plot all over again.
| | 03:59 | All I have to do is press OK, but
this time I don't have any outliers.
| | 04:04 | In fact, this is a
pretty normal-looking box plot.
| | 04:07 | I can see that of the 2800 companies in
the NASDAQ, the median level of market
| | 04:12 | capitalization is around $40 million.
| | 04:15 | The first quartile, the first lowest
25% have 20 million or less, whereas the
| | 04:22 | highest quartile have about $60 million or less.
| | 04:25 | There are of course hundreds of
outliers above these, but these give a nice
| | 04:29 | picture of what you'll call
the small capitalization market.
| | 04:33 | Anyhow, the ability to either combine
groups or to temporarily exclude outliers
| | 04:39 | is one good way of dealing with them,
as long as you can justify your choices.
| | 04:44 | Again, that gets back to a general
statistical principle that you can do
| | 04:48 | whatever you feel is most
appropriate and that serves your purposes in
| | 04:51 | telling an analytical narrative.
You're telling a story about your data, and
| | 04:56 | if temporarily excluding cases or
combining them with other groups serves
| | 04:59 | your purposes best, then go ahead and
do that, as long as you can justify your
| | 05:03 | decision to others.
| | 05:05 | Now, in the next video I will look at
another way that does not exclude the cases.
| | 05:10 | It leaves them all in, but changes them
by doing what's called a transformation,
| | 05:14 | to let you use all of your data and see
if you can still find a way of telling
| | 05:18 | a coherent narrative that way.
| | Collapse this transcript |
| Transforming outliers| 00:00 | In the last video, we talked about a few
relatively simple ways of dealing with outliers,
| | 00:05 | that is, either leaving them in, if it
can be justified; rolling them into other
| | 00:10 | categories, but at the risk of a
heterogeneous group; or deleting them or
| | 00:14 | selecting them out temporarily of the analyses.
| | 00:17 | Now while these approaches may make
sense if you don't have too many outliers,
| | 00:21 | say for instance no more than 2% or 3%
as a rough estimate, they also do some
| | 00:27 | damage to the data and can cause you to
lose cases, and you may have worked very
| | 00:31 | hard to get those data.
| | 00:33 | So another alternative if you have a
scale variable is to perform a mathematical
| | 00:38 | transformation on the data.
| | 00:40 | What this does is it modifies all
the scores in the variables, generally
| | 00:44 | creating a new variable on
the process, using a set formula.
| | 00:48 | Now people are very familiar with
transformations, such as multiplying or adding
| | 00:52 | or subtracting a certain amount,
and that's taken as common practice.
| | 00:56 | What we're going to be doing in
this case, the most common approach for
| | 01:00 | distributions that have a few
extremely high scores, like the market
| | 01:04 | capitalization one that we looked
at in the last one, is to take the
| | 01:07 | logarithm of the scores.
| | 01:09 | Now you may remember
logarithms from junior high.
| | 01:12 | These have the effect of
bringing in extremely high scores.
| | 01:16 | So for instance, the logarithm of 10
is 1, the logarithm of a 100 is 2, the
| | 01:23 | logarithm of a 1,000 is 3, and it
brings in the scores in a predictable way.
| | 01:29 | And this is a legitimate way of
dealing with outliers, as long as you always
| | 01:35 | specify that you were dealing with
the logarithms from this point on.
| | 01:39 | On the other hand, if you have unusual
scores at the low end of the distribution,
| | 01:44 | you might want to try squaring the
scores, because what that does is it pushes
| | 01:47 | all the scores up but
pushes the higher ones even further.
| | 01:51 | Now in both situations this assumes
that you do not have zeros or negative
| | 01:56 | scores, you have all positive scores.
| | 01:58 | There are other ways of dealing with
those. You can add a constant to them, but we
| | 02:01 | don't need to deal with that right now.
| | 02:04 | What I'm going to do is I'm going to
look at the market capitalization data that
| | 02:08 | we had in our last data set. Now I had filtered out
cases of under $100 million market capitalization.
| | 02:14 | I'm going to undo that filter right now.
| | 02:16 | I'm going to Data, to Select Cases,
to say please use all of them.
| | 02:24 | And so now it just tells me that the
filter is off, and you can see that none of
| | 02:29 | them are selected out anymore.
| | 02:30 | And I'm going to come back here and
let's take another quick look at the box plot
| | 02:36 | for market capitalization that we did before.
| | 02:42 | We have an extremely skewed distribution.
| | 02:45 | Now let's try to find if doing a logarithm
could help make this a little less skewed.
| | 02:52 | What we do is we come to Transform, to
Compute Variables, and I'm going to create
| | 02:58 | a new variable called
LogMarketCap, and that's pretty easy.
| | 03:04 | It is going to be the logarithm
of the market capitalization.
| | 03:08 | Now we've two choices for logarithm.
Log10, this is what's called the base 10 logarithm.
| | 03:13 | It takes the number 10 and raises it to
a particular exponent to get a number,
| | 03:17 | and that exponent is the logarithm.
| | 03:19 | There's also the natural logarithm,
which is on the base e 2.71828, dada, dada,
| | 03:24 | dada, and an irrational number.
| | 03:28 | And while they're very pleasing
aesthetic aspects of the natural logarithm,
| | 03:32 | because it's easier to
interpret the base 10 logarithm,
| | 03:35 | that's one we usually use.
| | 03:37 | So what I do is I double-click on
that and I bring it up the numerical
| | 03:40 | expression. I just
double-click on MarketCap and it fills it so it
| | 03:44 | says Log10MarketCap.
| | 03:47 | Press OK and it tells me that
it's created a new variable.
| | 03:52 | If I go to the data set, I can
see it right here at the end.
| | 03:57 | You see the numbers are much smaller
than most double digits, but that's
| | 04:01 | because we're dealing with very large
numbers over here, and that logarithm has
| | 04:04 | to do more with the
number of zeros in the number.
| | 04:07 | Now what I'm going to do is I'm going
to go back and create another box plot,
| | 04:11 | but instead of doing market
capitalization this time, I'll do the log of the
| | 04:16 | market capitalization.
| | 04:17 | Just drag that in and
leave everything else the same.
| | 04:22 | And in this case, what's interesting
about it is that we still have outliers, but
| | 04:28 | this time they are symmetrically distributed,
| | 04:30 | that we have outliers on the high end,
but we also have outliers on the low end.
| | 04:35 | And in fact, the
distribution is remarkably symmetrical.
| | 04:40 | It looks like it's spread out almost
exactly the same amount in each direction.
| | 04:44 | And you can see also that Apple, it is
an outlier, but look how close it is for
| | 04:48 | instance to Google, whereas here,
here's Apple and here's Google down here.
| | 04:54 | So what we've done is we've taken a
extremely asymmetrical skewed distribution and
| | 04:59 | by taking the logarithm, we've
pulled it in and made it symmetrical.
| | 05:04 | Now there are still outliers, but they
are on both sides and they're not terribly
| | 05:09 | far away like they were before.
| | 05:11 | And so we've taken a variable that
really we might not have been able to deal with
| | 05:15 | before or we had to cut awful lot of
the scores to make it work, but now we
| | 05:20 | can actually leave all of the scores
in, we can use the entire data set, and
| | 05:24 | still come pretty close to meeting the
assumptions of most of this statistical procedures.
| | 05:29 | And so a logarithmic transformation in
this case was a huge help in making our
| | 05:34 | data meet the assumptions that we need
to make it more manageable for analysis.
| | Collapse this transcript |
|
|
4. Working with the Data FileSelecting cases| 00:00 | When you're doing an in-depth
investigation of your data, there are times when
| | 00:04 | you'll want to focus on just some of the cases,
| | 00:07 | for example, all of the men over 50
who visited your website, or clients with
| | 00:11 | outstanding payments, or people
under 16 who have taken the SAT.
| | 00:15 | Now, one way to deal with this is to
sort the data and then delete all the cases
| | 00:20 | that you don't want and
save it as a new data file.
| | 00:23 | This is an option, but it can get
cumbersome, and you do run the risk of
| | 00:26 | multiplying data files or
losing track of what you've got.
| | 00:29 | An easier way is to have SPSS select
the cases of interest, and when this
| | 00:34 | happens, the other cases are still in
the data set, but are temporarily excluded
| | 00:38 | from the procedures, and you can then
switch to different selection criteria or
| | 00:43 | you can return to the entire data set.
| | 00:44 | It's a more flexible and efficient way of
working with interesting subgroups in your data.
| | 00:49 | For this example I am going to be
using the data set Searches.sav, which is
| | 00:54 | information about Google
searches on a state-by-state basis.
| | 00:57 | The first several searches all have to
do with statistical topics, for instance
| | 01:02 | the SPSS Google search term or
regression, and then I have some social media
| | 01:06 | ones, and then I have some sports ones.
| | 01:09 | One that's interesting at the right end
of the data set--so I am going to scroll
| | 01:12 | over--is an indication of whether a
state has an outline for a high school
| | 01:17 | statistics class, and maybe I would
want to restrict my analyses temporarily to
| | 01:23 | states that have this to see, for
instance, if that's associated with their
| | 01:27 | Google search patterns for statistical topics.
| | 01:30 | So the way that I am going to do
this is I am going to select cases.
| | 01:33 | I go up to the Data menu, and then I
come down to the bottom to Select Cases.
| | 01:38 | And the dialog box gives me several options.
| | 01:40 | The first one is to simply include all
the cases, which is what I have right now.
| | 01:44 | The second one is If condition is
satisfied, and the idea here is, say, if they
| | 01:49 | have a score on this variable that is
equal to this, or maybe another one, I can
| | 01:52 | have more than one variable.
| | 01:54 | And this is what I am going to use.
| | 01:56 | I am going to say whether they
have the statistics education.
| | 01:59 | That's going to be statistics_ed = 1.
| | 02:02 | I will show that to you in just a second.
| | 02:04 | I also have an option of
using the random sample of cases.
| | 02:07 | If I have a large data set, sometimes
it's a good idea to try doing an analysis on a
| | 02:11 | small part of it, let's say 20% or 30%
or 40%, and then trying again with other
| | 02:17 | parts of the data to see if
the patterns I found hold there.
| | 02:21 | You can also look for a time, or case
range, for instance all the customers
| | 02:24 | from 2009 or from 2007.
| | 02:27 | And the last one, Use a filter variable,
what happens is when I do a selection,
| | 02:32 | SPSS automatically creates an
indicator variable at the end of the data set.
| | 02:36 | So if I have one already, this
simply gives me the option of using that
| | 02:39 | existing filter variable.
| | 02:41 | The second below that, Output, is
grayed out because I haven't done a selection
| | 02:46 | yet, so I can't use those options.
| | 02:48 | So what I am going to do right now is
I am going to go to select If condition
| | 02:51 | is satisfied, and then I click on the If box
to say what my criteria are for the selection.
| | 02:57 | What I want to use here is the variable
about whether a state has a high school
| | 03:03 | curriculum for statistics.
| | 03:04 | That's near the bottom of
the variable list on the left.
| | 03:07 | I can simply double-click on that
and it puts it up in the Selection box.
| | 03:11 | Now, my selection in this case is very easy.
| | 03:13 | This is a 0, 1 variable.
| | 03:15 | It's called a dichotomous indicator variable.
| | 03:17 | It only has two options. And I just want the 1s,
| | 03:20 | so I am going to type statistics_ad,
which is already there, and I am going to
| | 03:24 | add =1. Once I've got that, I can go to
the bottom and click Continue, and that
| | 03:29 | shows up in my If condition is
satisfied in the selection box.
| | 03:32 | Now, the options at the
bottom in Output show up.
| | 03:36 | The first one is to simply filter out
the unselected cases. It's the default.
| | 03:39 | It's what I am going to use here.
| | 03:40 | But I do have two other options that
allow me to change the data set. The second
| | 03:44 | one, Copy selected cases to a
new data set, does exactly that.
| | 03:46 | It creates a second data set.
| | 03:49 | I have to give a name for that data set.
| | 03:51 | And then if I want to work with just
that one, it can be easier. Or I can get
| | 03:56 | rid of the cases that I didn't select.
| | 03:58 | There may be situations in which I
want to do that. You can call that
| | 04:00 | destructive editing.
| | 04:01 | I usually just filter out the
unselected cases, but it's up to you.
| | 04:06 | So now that I have got my criteria
specified by what I am selecting and what I
| | 04:10 | am going to do with the
unselected cases, I simply press OK.
| | 04:13 | Now the output file shows me the
syntax statements that it has used to
| | 04:16 | create the selection.
| | 04:17 | It doesn't show any charts here,
because we don't have them.
| | 04:19 | But if I go to the data file, you can
see that on the left the row numbers of a
| | 04:24 | lot of the cases are selected out,
because not too many states have a high
| | 04:27 | school statistics curriculum.
| | 04:29 | Also, on the right side you can see
there's a new variable there, Filter_$, that
| | 04:35 | says Selected or Not Selected.
| | 04:37 | That's a 0/1 variable.
| | 04:38 | If I turn off the variable labels with
the button on the menu bar, you can see
| | 04:42 | that those are 0s and 1s underneath,
but I will turn the labels back on now by
| | 04:46 | clicking on the Value Labels button.
| | 04:49 | So anything I do is going to
work only with the cases that I have
| | 04:52 | selected, which in this case are
states with a high school statistics
| | 04:56 | education curriculum.
| | 04:57 | I will make a box plot, for
example, of their SPSS searches.
| | 05:02 | I click on Graphs, to Chart Builder,
and then in the gallery on the bottom I
| | 05:08 | go to Boxplot, and I am simply going to drag
the one-dimensional box plot up into the canvas.
| | 05:14 | And from there, I drag in the
variable from the list that I want.
| | 05:18 | I am going to take SPSS and
drag that into the X axis.
| | 05:22 | Also, because I may have outliers
here, it's nice to have an ID to know
| | 05:27 | what states they are.
| | 05:28 | I can go down to the Group/Point ID
tab, I can select Point ID label on the
| | 05:33 | bottom, and then I need to drag in the
variable that provides the labels.
| | 05:38 | In this one it's the state code.
| | 05:40 | So I come up to the variable list
and drag the state code over, and now I
| | 05:45 | am ready. I click OK.
| | 05:47 | I first get a bunch of more code
that's the syntax for what I have done.
| | 05:50 | There is the GGraph command that gives
the data set, and then here is the box plot.
| | 05:56 | This shows the distribution of Google
search patterns in terms of how common
| | 06:02 | that particular search is relative to
others for several different locations,
| | 06:06 | and you can see we have an outlier,
it's Washington, D.C. up at the top, and
| | 06:10 | they search for this term SPSS
much more than other states do.
| | 06:15 | So anyhow, what I have here is a
selection criteria, the ability to temporarily
| | 06:21 | or permanently select a subset of cases
for a more thorough analysis, and this is
| | 06:25 | a great feature of SPSS.
| | 06:27 | It lets you really dive into your
data and get the most out of it.
| | 06:30 | In the next movie we'll look at a
related procedure called Split File
| | 06:34 | that also lets you work with subsets,
but instead of reporting on just one
| | 06:38 | subgroup at a time, it gives the
results for all of them so you can make
| | 06:41 | comparisons between the subgroups.
| | Collapse this transcript |
| Using the Split File command| 00:00 | In the last movie, we took a look at
a really handy procedure for selecting
| | 00:04 | subgroups of your data for a more
focused analysis--that was the Select Cases
| | 00:09 | or filter variable.
| | 00:11 | In this movie, we will explore related
procedure called Split File that also
| | 00:16 | breaks the data down by subgroups, but
unlike the Select Files command, it then
| | 00:21 | gives you the results for all of the
subgroups, and it'll let you make explicit
| | 00:25 | comparisons between the groups,
which can be a really handy feature.
| | 00:29 | Now when we left the data set, I had some
of the cases selected and some of them not.
| | 00:35 | You can tell that this is the case
because, obviously, over on the left a bunch
| | 00:39 | of the rows are crossed out.
| | 00:40 | Also, you see that on the right end of
the date set, I have a variable called
| | 00:45 | filter_$, and we have Not Selected and Selected.
| | 00:49 | Also, at the very bottom right of the
screen you see that it says Filter On.
| | 00:54 | This is an indication that the filter,
the selection criterion, is active.
| | 00:59 | So before I go on to do a Split File,
I need to turn off the selection.
| | 01:05 | I go back up to the Data menu, to Select
Cases, and than at the top of the box I
| | 01:11 | simply click on All Cases.
| | 01:14 | I don't have to erase the criterion.
| | 01:16 | It's okay if it's still there and I press OK.
| | 01:19 | And then the output, it tells me that the
filter is off and I'm now using all the cases.
| | 01:23 | If I go back to the data, you can see
that none of the cases are crossed out and
| | 01:28 | that down here on the bottom-right
the Filter On is not there anymore.
| | 01:32 | The variable that created the filter is
still there if I want use it later, but
| | 01:37 | now I am going to create a Split
File where I can compare several groups.
| | 01:42 | To do this, I am going go back up to
Data and I am going go down to the bottom
| | 01:46 | to Split File, which is
right next to Select Cases.
| | 01:50 | In this dialog box, I have
three options for Split File.
| | 01:54 | The first one is Analyze all cases, do not
create groups. That's what I have now.
| | 01:59 | That's the default. The next two,
| | 02:01 | Compare groups and Organize output by
groups, determine how things will look if
| | 02:06 | I request several procedures, or
a procedure that has a lot of output.
| | 02:10 | The first one, Compare groups, puts the
results for each step right next to each other.
| | 02:15 | So for instance, if I have tables and
charts, the tables for group 1, then the
| | 02:20 | tables for group 2, then the chart for
group 1 and then the chart for group 2.
| | 02:24 | On the other hand, Organize output by
groups would do the tables and the charts
| | 02:28 | for group 1, then the tables
and the charts for group 2.
| | 02:31 | I'm going to use Compare groups in this
case. It's a personal preference. From
| | 02:35 | time to time, I might use the
other one, and it's up to your judgment.
| | 02:39 | I click on Compare groups and then I
choose the variable that I'm going to use
| | 02:43 | to split the groups.
| | 02:44 | In this one, I'm going to use
the region of the United States.
| | 02:49 | So I need to scroll down on my variable
list and if I make the box wider, you can
| | 02:53 | see, I have Census Bureau Region.
| | 02:56 | That's the label. The variable name is Region.
| | 02:58 | I will just double-click on that,
and there it is, in the Groups Based on box.
| | 03:04 | So I've got the criterion in there, and
by the way, you can put more than one
| | 03:08 | in there if you want to split it by two
variables, but then things get rather complicated.
| | 03:13 | So I'm just going to press OK now, and
now in the Results it tells me that it
| | 03:18 | has sorted the data file by its
region and that it's now going to split
| | 03:21 | things by the region.
| | 03:23 | If I go back to the data set,
nothing is crossed out, because I'm using
| | 03:27 | everything, but you can see that
Region is sorted here in this column. And if
| | 03:31 | you go to the very bottom right of this
screen, you'll see that it says Split by region,
| | 03:36 | so I know that it's going to be doing this
where it does things separately for each group.
| | 03:41 | So what I'm going to do now is I am
going to request some information.
| | 03:45 | I am just going do histograms.
| | 03:47 | I go to Graphs and to Chart Builder.
| | 03:51 | Now I am going to come down to Histogram.
| | 03:52 | I am going to drag the basic histogram up
into the canvas and then I select the variable.
| | 03:59 | I will use the SPSS Google Search.
| | 04:02 | So I click on that and drag it to
X axis in the canvas and from there, I
| | 04:07 | can simply Press OK.
| | 04:09 | And then in My Results what you
see is I have several histograms.
| | 04:13 | These are very large chunky ones
because there are not a lot cases in them, but
| | 04:17 | this is for the first one. This is
for the Northeast region of the United
| | 04:20 | States. But this is for the Midwest, and
this has more bars because there's more cases.
| | 04:26 | If we come back up, you see there's only
nine states in the Northeast region, and
| | 04:31 | this one has 12, and we have
the South and then the West.
| | 04:37 | So what it's done is it's done a
procedure but it's done it separately for each
| | 04:42 | of these particular groups.
| | 04:43 | I can get much more complicated
procedures that we'll cover later in the
| | 04:47 | course and break them down by region or by
some other variable, or combination of variables.
| | 04:53 | So the Split File command, along with
the Select Cases command, this is a great
| | 04:58 | way to focus on subgroups and get a
deeper understanding of your data, and by
| | 05:03 | comparing the results for one group
to the next, you can see whether the
| | 05:06 | patterns you find hold across groups
or whether you should dive even deeper
| | 05:10 | into your data.
| | Collapse this transcript |
| Merging files| 00:00 | When you are getting ready to analyze
your data, you may have the situation
| | 00:04 | where your data lives in more than one file.
| | 00:07 | Now, SPSS lets you have more than one
file opened, but in a number of procedures the
| | 00:12 | data needs to be in the exact same file.
| | 00:14 | Fortunately, SPSS has a command that
lets you combine data, either by adding new
| | 00:20 | cases that have the same variables or
by adding more variables for the existing
| | 00:26 | cases, and in this movie I am going
to show you how to do both of these.
| | 00:31 | I am beginning with a
data set that's called Search1.sav.
| | 00:36 | This is simply the top-left quadrant of the
data file that we used in the last two movies.
| | 00:42 | I have information of a number of
states about Google search patterns.
| | 00:46 | What I am going to do though, is if you
scroll down, you can see that I only have
| | 00:51 | data through Montana.
| | 00:53 | I have 27 cases here.
| | 00:55 | I want to add the remaining states
using the same variables, and what I have is
| | 01:00 | another data file that has all the
same variables in the same order but has
| | 01:05 | the remaining states.
| | 01:07 | To do that, I come up to Data and I
come down about halfway to Merge Files and
| | 01:14 | this is where it asks me if I want to
add cases--that's more observations with
| | 01:18 | the same variables--or whether I want
to add variables for the same cases.
| | 01:23 | I am going to do both, but on this one
| | 01:24 | I am going to add cases.
| | 01:27 | Now, you can do this with either a data
file that's currently opened--that's the
| | 01:32 | top one, an open data set, but that's
grayed out because I don't have another
| | 01:35 | data set opened right now--or you
can use an external SPSS data file.
| | 01:40 | I have that other data file.
| | 01:42 | It's saved in the folder, and I am just
going to open it up by clicking Browse.
| | 01:46 | This one is just called Search2.
| | 01:48 | I am going to double-click on that and
then the full path shows up right here,
| | 01:54 | and I am just going to click Continue, and so
what it does now is it brings up a dialog box.
| | 01:59 | It attempts to pair the variables
by whether they come from the active
| | 02:03 | data set or from the one that I am
opening, but since I have the exact same
| | 02:07 | variables in both of them,
| | 02:08 | everything is paired up in the two of them.
| | 02:10 | I can scroll down the list and you
see that all the same variables occur.
| | 02:15 | If I wanted to, I can select
Indicate case source as a variable.
| | 02:20 | That's at the bottom of the list.
| | 02:22 | What this would do is it would add a
new variable to the data set, and it would
| | 02:28 | indicate whether the cases came from
the first data set or the cases came
| | 02:32 | from the second data set, and it's
a way I am keeping things straight.
| | 02:36 | I don't need it in this case because
there is no overlap and there will be no
| | 02:38 | confusion between the two of them.
| | 02:40 | I am just going to press OK, and I get the
syntax and the results that say it is adding cases.
| | 02:48 | I go back to the data set.
| | 02:49 | Previously, I only went through Montana,
and now you can see that I have added
| | 02:53 | Nebraska all the way down to Wyoming.
| | 02:57 | Now, I have the same variables in the
same order. Now I just have more cases.
| | 03:01 | On the other hand, maybe I have the
cases I want but I want to add more
| | 03:05 | variables, more information about them.
| | 03:08 | What I have right now is
just Google's search history.
| | 03:11 | I can scroll through, and all of
these end with _GS to indicate these are
| | 03:15 | Google Search patterns.
| | 03:16 | But I have other information about
each state that would be useful in
| | 03:20 | analyzing these patterns.
| | 03:22 | So what I am going to do now is I am
going to add new variables to the data set.
| | 03:26 | I go back to where I was before, I go up
to Data, come down again to Merge Files,
| | 03:33 | except this time I select
the second option, Add Variables.
| | 03:38 | Again, I have the option of using an
open data set, but the one I have isn't open,
| | 03:44 | or an external data set.
| | 03:46 | Mine are saved in an external data set,
so I am going to click on Browse and I am
| | 03:50 | going to use Search3.
| | 03:52 | I will just double-click on that.
| | 03:54 | There it is and I click Continue.
| | 03:58 | Now, it's bringing up the data set. There is
one variable that is excluded and it's state.
| | 04:04 | Now, that's the key variable that I used in
both of them as a way of lining things up.
| | 04:09 | You can see for instance that it has
state and then it has a plus in parenthesis.
| | 04:14 | That tells me that it's from
the new data set that I am adding.
| | 04:17 | So it would be redundant;
we don't need it again.
| | 04:21 | All I am going to do now is click OK and it
tells me that it's adding a bunch of new variables.
| | 04:26 | I go back to the data set, and previously
we stopped with the Google Searches, the _GS,
| | 04:34 | but now you can see I have
added several new variables--
| | 04:37 | I am going to scroll through them--
from has_NFL, whether a state has an NFL
| | 04:42 | team, through Division.
| | 04:44 | And so what I have done is in the
first example I added new cases to the
| | 04:48 | data set, I added new states, and in the
second example I added new variables.
| | 04:53 | And what this does is it takes three
separate data files and combines them into
| | 04:57 | one, which lets me do more analyses--
compare the relationships between the
| | 05:02 | variables--than I would be able to do otherwise.
| | 05:04 | Now, the data may have been spread out
across several sources, in typically many
| | 05:08 | different locally stored spreadsheets
in an organization, and by merging the
| | 05:13 | cases or the variables, you're able to
get in a much more productive situation of
| | 05:18 | having all of your data in one place.
| | 05:21 | When you have that then it's much
easier to break things down to compare the
| | 05:25 | groups and to examine trends and outcomes.
| | 05:28 | All of these can give you a much
more powerful insight into your data.
| | Collapse this transcript |
| Using the Multiple Response command| 00:00 | It's usually a good idea to enter
your data in its least processed and
| | 00:05 | most disaggregated form,
| | 00:08 | that is, put the raw data in and any
processing you need to do, do in SPSS.
| | 00:13 | That way you can combine things if you
want. On the other hand, if you bring the
| | 00:18 | data into SPSS in an aggregated or
combined or summary form, then you can't
| | 00:23 | break it down later.
| | 00:25 | Now one way of dealing with data that
you want to aggregate, as long as you
| | 00:29 | are dealing with nominal or categorical
variables, is with the Multiple Response function.
| | 00:35 | It's one of the neat tricks in the SPSS.
| | 00:37 | This function combines the responses
from several variables and allows you
| | 00:42 | to create frequency tables and cross
tabulations as though they were a single variable.
| | 00:48 | In many circumstances,
this can make life much easier.
| | 00:51 | The first thing to say here is that
you can organize the data in a couple of
| | 00:55 | different ways, and Multiple
Response can deal with either one of them.
| | 00:59 | In this data set, Tickets.sav, I have
hypothetical data about the purchase of
| | 01:06 | season tickets to seven
different kinds of events.
| | 01:09 | I have Baseball and Basketball and
Football as well as the Symphony, the Opera,
| | 01:15 | the Theatre, and the Ballet.
| | 01:17 | And the idea here is we might want to
look at what kinds of season tickets
| | 01:21 | people have, how many they have, and
whether there is, for instance, a difference
| | 01:25 | in the gender and the age and the
overall preferences of the buyer. And again,
| | 01:30 | this is hypothetical data.
| | 01:32 | I have it set up first where I have
each possible event, the three sports and
| | 01:38 | the four cultural events, as indicator variables.
| | 01:41 | So you see here for Baseball we have Yeses
and Nos for whether a person has season
| | 01:46 | tickets to Baseball, and
then to Basketball and Football.
| | 01:50 | Then I have a column that adds up how many
sports events they have season tickets to.
| | 01:54 | The first person has season tickets
to two sporting events, Baseball and
| | 01:58 | Football. The second person has none.
| | 02:01 | And then I have four cultural events.
| | 02:03 | I am going to scroll over a little bit, so you
can see all of it, and I do a similar thing.
| | 02:08 | I add up how many cultural tickets
people have. Then I also have another one,
| | 02:13 | combining both the sports and the
cultural, how many season tickets they have
| | 02:16 | all together. I am being a little
optimistic, but this is how that works.
| | 02:20 | So this is a series of what are
called dichotomous indicator variables.
| | 02:24 | Dichotomous means just two possible
values, yes, no; male and female; and an
| | 02:30 | indicator variable is a 0/1
variable, where 0 is no and 1 is yes.
| | 02:35 | In fact, if I go up to the menu bar and
click on this button for Value Labels,
| | 02:42 | you'll see the 0s and the
1s that are underneath these.
| | 02:44 | I put the Value Labels back on,
| | 02:46 | you can see the Yeses and the Nos.
| | 02:49 | So the indicator variables is one way
| | 02:51 | I list every possible choice and I
put down a Yes or No for each person.
| | 02:56 | The other way of organizing multiple
response data is by simply having a
| | 03:01 | variable for the maximum number
of choices that a person can have.
| | 03:04 | Now in this hypothetical data set
nobody had more than four sets of season
| | 03:09 | tickets, and so what I have is Tix1, 2, 3
and 4, by whether they have season tickets.
| | 03:16 | There are seven options for each one
of these, and I simply put down the first
| | 03:20 | one, the second one, and if that's
all they have, I put 0s for the rest.
| | 03:24 | You can see actually I have some
people who have no season tickets at all,
| | 03:27 | down about case 16.
| | 03:30 | This is a way that people often do
coding, especially if it's open ended,
| | 03:34 | write down all of your feelings or
your responses to a particular question,
| | 03:38 | but I'll let you know right now, this
kind right here, the Tix1 through 4 where
| | 03:43 | we can have any of the categories
in any of the columns, this can get
| | 03:47 | extremely cumbersome.
| | 03:48 | In my experience the indicator
variables, even though we have to have more of
| | 03:52 | them, is more amenable to adding
things up and to doing other analyses.
| | 03:57 | Now with that in mind let me show you
how to set up a Multiple Response format.
| | 04:02 | The first thing you have to do is
define what are called variable sets, the
| | 04:06 | variables that should be treated
as instances of a single category.
| | 04:11 | You go up to Analyze and then you go
down near the bottom to Multiple Response
| | 04:16 | and define Variable Set.
| | 04:17 | You'll see I have two other options
beneath that, Frequencies and Crosstabs. They
| | 04:22 | are not available yet,
because I haven't defined any sets.
| | 04:25 | I click on that, and I
am going to do this twice.
| | 04:28 | I am going to do once with the
indicator variables--that's the 0, 1, yes, no
| | 04:32 | variables--and another one with the
multiple choice ones, the four columns for
| | 04:37 | the four kinds of tickets people have.
| | 04:39 | So what I do is I first scroll down
here and I'll pick the three sporting
| | 04:45 | events and put those over here, and
then I'll click the four cultural events,
| | 04:51 | and I'll put those over.
| | 04:52 | And then what it does is it asks me
whether these are dichotomies--that's the
| | 04:56 | 0, 1 for instance--or whether they are
categories, where it's the 1 through 7.
| | 05:01 | This part is the dichotomies.
| | 05:03 | And it says, which one counts as a yes,
because it might be 0, 1, but it might
| | 05:08 | be 1, 2, or something else.
| | 05:09 | I just have to indicate that
it's the 1 that counts as a yes.
| | 05:13 | And then I have to give a name to the
Multiple Response set, and what I am going
| | 05:18 | to call it here is TixDichotomies,
Dichotomous Variables for ticket purchases.
| | 05:29 | And then I click on Add over on the right.
| | 05:33 | And so what this does is it
creates a Multiple Response Set.
| | 05:37 | It's $TixDichotomies. This won't
show up in the data set because this is
| | 05:42 | more like a metadata.
| | 05:44 | It's information about the data set that
the computer saves. So I have done this,
| | 05:48 | and I can press Close now.
| | 05:50 | You see the data set does not look
different, but if I now come up to Analyze and
| | 05:57 | back down to Multiple Response,
I now have these two other options of
| | 06:01 | Frequencies and Crosstabs available.
| | 06:04 | What I can do for instance is I can
click on Frequencies, and there is the
| | 06:08 | Multiple Response Set that I just created.
| | 06:10 | All I do is I move it over and I press OK.
| | 06:16 | And I get a table that says, how many
people had purchased each kind of ticket?
| | 06:21 | Now this is the same
thing as the 0, 1 indicator.
| | 06:24 | It's simply telling me how many
people had basketball tickets, how many
| | 06:28 | people had opera tickets.
| | 06:30 | So this is one way of doing it.
| | 06:32 | I can also do cross-tabulations.
| | 06:35 | If I go back to Analyze, to Multiple
Response, to Crosstabs, I can say that I
| | 06:40 | want to look, for instance, at whether
there are gender differences in these.
| | 06:45 | And I can put the Multiple Response
Variable in the Column(s) and gender up here.
| | 06:49 | However, I have to define the gender
variable. I'll define the range and I
| | 06:53 | simply tell it that I have
0s and 1s. Press Continue.
| | 06:58 | Then I can click OK, and this
is called a cross-tabulation.
| | 07:02 | It lets me know the number of men and
women who have season tickets of each kind.
| | 07:05 | We'll go back to crosstabs in a later
movie, but I just wanted you to see that
| | 07:10 | there is an option with
the Multiple Response Set.
| | 07:13 | Now, I can also do multiple responses
with the other kind where I have it open ended
| | 07:18 | where people can put anything for the
first set of tickets they have to second
| | 07:22 | set. Let's look back at the data set.
| | 07:24 | That's these four at the end.
| | 07:25 | I only need four, because four is
the most that anybody purchased.
| | 07:29 | To do this one I come back to Analyze,
back down to Multiple Response, and I am
| | 07:34 | going to define a new variable set.
| | 07:37 | This time I scroll down and I select
these last four, First Season Ticket
| | 07:42 | through Fourth Season Ticket, and
then move those over to Variables in Set.
| | 07:48 | In this case, they are not dichotomies;
they are categories. And I need to tell it
| | 07:52 | the range. There were seven possible
choices, so I need to say it goes from 1 to
| | 07:58 | 7. Then I need to give it a name.
| | 08:01 | Now the last one was
TixDichotomies. I might as well call this one
| | 08:04 | TixCategories. Ticket Categories, this
would be my label, and then I click Add.
| | 08:13 | So that shows up as another response set.
| | 08:16 | I click Close and I can do the frequencies
and the crosstabs again using it this way.
| | 08:22 | So I come back up to Analyze, to
Multiple Response, to Frequencies.
| | 08:29 | Now I used the dichotomies the last time.
I'll just double-click and get that out of there.
| | 08:34 | I'll use the Categories this time and hit OK,
and you see I get the same kind of information.
| | 08:41 | It's just the data was organized differently.
| | 08:45 | I can also do the crosstabs the same way.
| | 08:47 | Going up to Analyze, to Multiple Response,
to Crosstabs, so this time I take out
| | 08:53 | the Dichotomies and I put in the Categories.
| | 08:58 | Now I get the same output either way,
which will make it seem that these two
| | 09:02 | methods of creating multiple response
sets or equivalent; however, I'll let you
| | 09:07 | know there is a trade-off.
| | 09:08 | The Multiple Response set that's
created on the categories, that is, with these
| | 09:12 | multiple choice ones, where
people could put any of the answers,
| | 09:16 | about the only way to use these variables
is with Multiple Response sets, and they
| | 09:21 | are very limited in their application.
| | 09:23 | On the other hand, if you do the
indicator variables, which I had over to the
| | 09:27 | left, these are much more flexible, and
they can be used in other procedures like
| | 09:33 | getting correlations and regression
that we'll do later, which is why I almost
| | 09:38 | always use the indicator variables,
the 0, 1 variables for each choice.
| | 09:42 | The only trouble is if you had,
for instance, a lot of possible responses. You
| | 09:47 | could end up with a huge number of
indicator variables where you could only have a
| | 09:52 | smaller number of these category columns.
| | 09:55 | On the other hand, if you really have
that many choices, you might be wise to
| | 10:00 | your collapse categories and combine them.
| | 10:02 | Anyhow, the Multiple Response
function in SPSS can be a nice way of dealing
| | 10:07 | with situations where people can choose or
write in more than one answer to a question.
| | 10:12 | The procedure is flexible because
it can used dichotomous indicator
| | 10:16 | variables, that's the 0, 1, for each
possible choice, or a smaller number of
| | 10:21 | categorical variables with
several choices for each.
| | 10:24 | However, the procedure does limit you to
doing just frequencies or crosstabs for
| | 10:29 | other nominal, ordinal variables.
For these reasons I generally recommend that you
| | 10:33 | use the dichotomous indicator variables.
| | 10:35 | But for now the Multiple Response
function is an important tool in your
| | 10:39 | collection for data-analysis strategies.
| | Collapse this transcript |
|
|
5. Descriptive Statistics for One VariableCalculating frequencies| 00:00 | One of the most general commands for
getting descriptive statistics in SPSS, and
| | 00:05 | my personal favorite, is the
Frequencies command in the Analyze menu.
| | 00:09 | This is a great way to get all of the
common descriptive statistics you might
| | 00:13 | want, such as the mean, the standard
deviation and the quartiles--that includes
| | 00:18 | the minimum, the median and the maximum--
for several variables at once, and to
| | 00:22 | get simple charts such as
histograms or bar charts at the same time.
| | 00:27 | I view it as SPSS's one-stop shopping
center for basic statistics for almost
| | 00:32 | any kind of variable.
| | 00:34 | For this example, I'm going to
be using the NASDAQ data set.
| | 00:38 | This is information about all 2,800
stocks listed on the NASDAQ Stock
| | 00:44 | Exchange, and I'm going to be
gathering some descriptive statistics about a
| | 00:49 | few of these variables.
| | 00:50 | The information about the LastSale--
that's how much shares went at the time
| | 00:54 | that I gathered this data--the market
capitalization of each country, as well as their sector.
| | 01:00 | And what I'm going to do is I'm going
to come up to Analyze, to Descriptive
| | 01:05 | Statistics, to Frequencies, the very first one.
| | 01:08 | Now the Frequencies command is
associated for a lot of people with just
| | 01:11 | categorical variables, because it
gives frequency tables, how common each
| | 01:16 | particular answer is, and
it's well suited to this,
| | 01:19 | but it also is a very well suited
to dealing with scaled variables.
| | 01:23 | I'm going to begin with a categorical
variable, because that's the most familiar for people.
| | 01:28 | The variable that I'm going to use in
this case is called Sector Code, so I'm
| | 01:31 | just going to come down here to
SectorCode, select that, and move it over to the
| | 01:36 | Variable list on the right.
| | 01:38 | Now by default it's going to give me a
Frequency table, but I can ask it for
| | 01:42 | a few other things.
| | 01:44 | With a categorical variable like SectorCode,
the most important would be a bar chart.
| | 01:49 | And if I come right over here to Charts,
I can ask it to make a bar chart and
| | 01:54 | just press Continue, and then I press OK.
| | 01:58 | And what I have here is it tells me
that it's gotten statistics for 2,820 cases.
| | 02:04 | There's no missing data, and this first
one is the frequency table that comes by
| | 02:08 | default, and what it has is the name of
each of the categories under Sector, from
| | 02:13 | Basic Industries through Transportation.
| | 02:16 | Then it has the frequency,
| | 02:18 | that is, the number of industries
that fall into each of those categories.
| | 02:21 | For instance, 133 of these had no
SectorCode listed, but under Healthcare, 234
| | 02:28 | companies were listed.
| | 02:30 | The next one is the Percent,
| | 02:31 | that is, of all of the
cases 1% fall into each one.
| | 02:35 | So Capital Goods, which had a
Frequency of 204, that accounts for 7.2% of the
| | 02:42 | companies in the NASDAQ Index.
| | 02:44 | Now the next one, Valid Percent, is
the same because we have no missing data,
| | 02:50 | but say for instance, that half
of the companies were missing data.
| | 02:53 | There was no response at all under SectorCode.
| | 02:57 | Then instead of Basic Industries being
2.8%, it would be 5.6%, because the valid
| | 03:04 | percent excludes the missing cases,
or the cases that are missing on that
| | 03:09 | particular variable.
| | 03:11 | The Cumulative Percent simply takes the
Valid Percent and adds it on as it goes.
| | 03:16 | So it finishes with 100% by the time
it gets to the last valid category.
| | 03:21 | So that's the frequency table.
| | 03:23 | The next thing is I asked
it to produce a bar chart.
| | 03:28 | Now this is a bar chart that is
produced as a sort of supplementary feature of
| | 03:33 | the Frequencies command, and I would
probably want to go through and edit it to
| | 03:38 | sort them from the most
common sector to the least common.
| | 03:42 | So Finance would be first and it looks
like Transportation would be the last.
| | 03:47 | I might flip it sideways so
it would be easier to read,
| | 03:50 | but those are the things that we
covered in the section on creating bar charts
| | 03:54 | as univariate charts.
| | 03:55 | But this is a very simple way to
get a lot of good information about a
| | 03:59 | categorical variable.
| | 04:00 | Next, what I'm going to show you is
how to use the Frequencies command to
| | 04:04 | get information about a scaled
variable, something that people don't use that
| | 04:08 | often for that purpose.
| | 04:10 | I come back up to Analyze and I come
to Descriptive Statistics, again to
| | 04:15 | Frequencies, except this time I'm
going to reset it, and I'm going to pick
| | 04:20 | two scaled variables.
| | 04:22 | I'm going to pick LastSale--
| | 04:24 | that's the price of the individual
stock shares the day before I gathered the
| | 04:27 | data, and the market capitalization.
| | 04:30 | So I just double-click to move
both of those over, and then I can ask
| | 04:34 | for certain statistics.
| | 04:37 | There's a few that are really helpful.
| | 04:38 | Number one is the Mean, the average.
| | 04:41 | I also like to get the standard
deviation, which is an indication of how
| | 04:45 | spread out the scores are.
| | 04:47 | The mean and the standard deviation
are very common statistics, although they
| | 04:51 | both work well for bell curves, and
I happen to know that both of these variables
| | 04:56 | are very skewed, and that's one reason
why I also want to use what are called
| | 05:01 | percentile- or quartile-based measures,
| | 05:04 | that is, the minimum and the maximum
and then the 25th percentile, the median,
| | 05:09 | the 50th percentile and the 75th percentile,
also called quartiles, all the way through.
| | 05:15 | Now if I wanted to, I sometimes
could get information about skewness and
| | 05:19 | kurtosis which are indications of how
closely the data fit a bell curve on
| | 05:25 | normal distribution, but I'm
not going to do that right now.
| | 05:27 | So all I'm going to do now is
I'm going to click Continue.
| | 05:30 | Now because I have scaled variables, it
can also be nice to get a histogram.
| | 05:36 | And so I go up to Charts and I click Histogram.
| | 05:40 | I could show the normal
curve, what it should look like--
| | 05:42 | undo that, that being more
for humor here--and click Continue.
| | 05:47 | Now there's one more thing I want to do here.
| | 05:50 | When I come back to this list you
see that the Display frequency tables,
| | 05:54 | which is below the Variable list,
is checked. That's by default in the
| | 05:58 | Frequencies command.
| | 05:59 | However, because all 2,800 companies have
different market capitalization values,
| | 06:04 | this will give me a list of 2,800
different values. I don't want that.
| | 06:09 | I'm using summary statistics to avoid that,
| | 06:12 | so what I'm going to do is
I'm going to uncheck that.
| | 06:14 | When I'm using the scale variable,
I usually don't want the Frequency table.
| | 06:18 | And now I can click OK.
| | 06:21 | And what I get here are a
couple of different things.
| | 06:23 | First off, I get a table of statistics
that lists each variable as a column.
| | 06:29 | So the first column is LastSale, the
second column is Market Capitalization, and
| | 06:34 | then each row is the various statistics
that it gathered, from the valid and how
| | 06:38 | many cases have values for that
particular statistic, to the mean and standard
| | 06:43 | deviation, then to these quartile-
based statistics. And then from these for
| | 06:47 | instance, I can see that the average
value of a share on the NASDAQ was $18.72.
| | 06:55 | I can also see that the minimum is
$0.01, at which point I think they
| | 06:59 | drop off the market.
| | 07:02 | Below those tables I have histograms.
| | 07:05 | This is the value of a share in a
particular stock, and what you can see is
| | 07:11 | everything is bunched up really low.
| | 07:13 | Most stocks have prices that
are, for instance, below $50.
| | 07:18 | And in fact, if I go back up to the table,
I can see that 75% of the stocks have
| | 07:24 | values that are less than $23.61,
but some of them, the maximum, get huge.
| | 07:31 | The maximum price for a stock on the NASDAQ
is $1,132, which is why when we come down here
| | 07:40 | we see that the scale
goes all the way up to $1200.
| | 07:41 | There is one very high
outlier sticking out up there.
| | 07:46 | I also have a histogram for market
capitalization, and again, we know from
| | 07:50 | before that this goes up to $300
billion and so most of the companies are stuck
| | 07:55 | right there in the very first one and a
very low level of market capitalization,
| | 08:00 | But there's a few that go up very, very high.
| | 08:02 | What these histograms do is they do
give me an indication that we have some
| | 08:06 | extraordinary outliers, but this also
gives me an indication with the table of
| | 08:12 | an idea of how I can describe those outliers.
| | 08:15 | And so I think this demonstration shows
how flexible the Frequencies command is
| | 08:20 | and why it's one of my favorite
procedures, especially because it works with
| | 08:24 | both categorical and scale variables.
| | 08:27 | It gives percentile statistics.
| | 08:29 | It can do frequency tables.
| | 08:31 | It can do charts at the same time.
| | 08:33 | This makes it my first stop when
getting the fundamental statistics for my
| | 08:37 | data, and I'm sure you'll find it
especially useful for your data and your
| | 08:41 | analyses too.
| | Collapse this transcript |
| Calculating descriptives| 00:00 | One of the first steps in any data
analysis is to thoroughly investigate each of
| | 00:05 | your variables one at a time,
| | 00:07 | that is, to get univariate analyses.
| | 00:10 | I've already described one procedure for
getting univariate information with the
| | 00:14 | frequencies procedure, and that work
for both categorical and scale variables.
| | 00:19 | Another important option for univariate
statistics is the Descriptives command.
| | 00:24 | This command and the Frequencies command do
a lot of the same things but there are
| | 00:28 | some important differences. The most
significant is that frequencies can work
| | 00:33 | with Categorical variables and Scale
variables but descriptives works only
| | 00:38 | with Scale variables.
| | 00:39 | In this movie, I will highlight the
similarities as well as point out some of
| | 00:43 | the unique advantages of
the Descriptives command.
| | 00:46 | For this example I will be using the
same data set that I used in the last one.
| | 00:50 | That's information about the
stocks in the NASDAQ index, NASDAQ.sav.
| | 00:55 | To get the descriptives, I go up to Analyze,
to Descriptive Statistics, to Descriptives.
| | 01:02 | From here, I select the variables that I want.
| | 01:04 | You will notice it doesn't
list all of the variables.
| | 01:07 | It only lists the ones that are numeric.
| | 01:10 | The symbol and the name variables, as
well as industry, are text variables and
| | 01:16 | they are categorical and it
simply doesn't list them here.
| | 01:19 | So I am going to take the two that I
used in the last example. That's LastSale--
| | 01:25 | so I am just going to click to move
that over to the right--and MarketCap--
| | 01:29 | I am also going to move that over to the right.
| | 01:31 | Then what I can do is I can get options
where I select the statistics that I want.
| | 01:39 | Now, by default, the descriptives gives
me the mean, the standard deviation, the
| | 01:44 | minimum and the maximum,
and these are a good list.
| | 01:47 | I can also get Kurtosis and Skewness if I want.
| | 01:50 | What's important though is I cannot
get the quartiles. I can't get the 1st
| | 01:55 | quartile, or 25th percentile score,
I can't get the 3rd quartile, or 75th
| | 02:00 | percentile score, and I can't get the
median, and for a skewed distribution those
| | 02:04 | are important statistics.
| | 02:06 | So that is one reason to sometimes
use the Frequencies command over the
| | 02:11 | descriptives, is if you need
the median and the quartiles.
| | 02:14 | But I will just click Continue.
| | 02:16 | Now, I have another option here.
You will see at the bottom-left
| | 02:19 | it says Save standardized values as variables.
| | 02:22 | This is one of the big perks
of the descriptives command.
| | 02:26 | If you want to take a variable that
is in some metric like dollars or an
| | 02:30 | arbitrary metric that may be a
foreign currency you're not familiar with,
| | 02:35 | sometimes you want to save things as
standardized variables. That makes it so
| | 02:38 | that the mean is 0 and the standard
deviation is 1, and the individual cases
| | 02:43 | get scores that indicate how many standard
deviations above or below the mean they are.
| | 02:48 | These are also called Z scores.
| | 02:50 | I have seen people
demonstrate how to do this manually by
| | 02:54 | calculating everything.
| | 02:55 | That's very tedious.
| | 02:57 | The descriptives gives you one stop way
of doing this: you simply click the box
| | 03:01 | and it will add standardized values
for these things. And we can see how many
| | 03:06 | standard deviations above or below the
mean some of the companies are on these
| | 03:11 | items for last sale and market capitalization.
| | 03:13 | Now, there is also an option here for Bootstrap.
| | 03:16 | I am not going to get into that one
because the Bootstrap is an add-in feature
| | 03:20 | that you pay extra for in SPSS.
| | 03:22 | I am just going to deal with
the ones that come standard.
| | 03:25 | So now I can press OK and what I get
is a small table. In the Frequencies
| | 03:31 | command, the variables were listed as
columns across the top and the statistics
| | 03:35 | were listed as rows down the side,
but Descriptives, it's flipped around.
| | 03:39 | But what I have here is the number
of cases that I have information on.
| | 03:44 | So for last sale I have 2817
companies with information on that.
| | 03:48 | The minimum value is $00.1,
the maximum value is $1,000 or $1,132, the mean is
| | 03:58 | 18.7, and the standard deviation is 34.65
and then you have similar statistics for
| | 04:04 | market capitalization.
| | 04:06 | Now, an interesting trick is if we go
back to the data set, and you see that we
| | 04:10 | have two new columns here at the end,
| | 04:13 | ZLastSale for the Z score or
standardized value, where you can see that most of
| | 04:19 | the scores are close to 0 or 1.
We do have a major outlier at 9.99.
| | 04:24 | That's nearly 10 standard
deviations above the mean.
| | 04:28 | That's Apple computers, where their stock is
costs about 10 times as much as most others.
| | 04:33 | And then we have Z market capitalization.
| | 04:36 | That's, again, a Z score, how many standard
deviations above or below, and then Apple
| | 04:41 | is again 32 standard deviations
above the mean on this particular one.
| | 04:47 | Hopefully, from all of this you can
see that the Descriptives command is a really
| | 04:51 | useful way of getting a variety of
univariate statistics for your data. Like the
| | 04:56 | Frequencies command,
| | 04:57 | it can give the mean, the standard
deviation, minimum, maximum, and other statistics.
| | 05:02 | It can give you the standardized scores,
which the Frequencies command can't do.
| | 05:06 | On the other hand, Frequencies can
give the percentile statistics like the
| | 05:10 | quartiles and the median.
| | 05:12 | It can give the mode, it can give
frequency tables and charts, and it can work
| | 05:16 | with string variables and categorical variables.
| | 05:19 | Now, for these reasons I generally
prefer to use the Frequencies command, but
| | 05:23 | either one will get you a very long
way towards a sound understanding of your
| | 05:27 | data and a solid
foundation for further analysis.
| | Collapse this transcript |
| Using the Explore command| 00:00 | SPSS has a number of really wonderful
tools for helping you to get an in-depth
| | 00:05 | understanding of your data.
| | 00:07 | We've already looked at the Frequencies
and Descriptive commands, which can give
| | 00:11 | you nearly everything you
need under normal circumstances.
| | 00:15 | However, there are times when you need
to look at things even more closely and
| | 00:19 | this is where SPSS's Explore command
comes in, with more ways to look at
| | 00:24 | univariate statistics than you can
shake a stick at, and let's look at some of
| | 00:28 | those possibilities.
| | 00:30 | To get to the Explore command you go up to
the Analyze menu, to Descriptives, to Explore.
| | 00:37 | What you have here is a list of all the
variables, both categorical and scale on
| | 00:42 | the side, and a number of options here.
| | 00:45 | What we are going to do is take the
variables that we want and put them in
| | 00:48 | the Dependent list.
| | 00:49 | Now the term Dependent here means
dependent variable, or an outcome variable, or
| | 00:55 | the variables that you want a statistics on.
| | 00:58 | In this case, I'll use the same
ones that I used in the last ones.
| | 01:02 | I'll use LastSale and I will use MarketCap.
| | 01:08 | Now Factor List is in case I
want to break down the list.
| | 01:12 | For instance, if I wanted to do
LastSale and MarketCap by different sectors.
| | 01:17 | I could do that, but there are 12
different sectors and at the moment I
| | 01:21 | don't feel a need for it.
| | 01:23 | I can also label the cases, and this can
be handy because this will give me some
| | 01:27 | charts that show outliers, and in fact
I'm going to do that by coming up and
| | 01:32 | getting a stock symbol and putting
that down there. Then I want to go through
| | 01:37 | some of the options over here.
| | 01:39 | I can choose what
statistics Explore gives to me.
| | 01:44 | I click on Statistics and by default
it's going to give me the mean and a
| | 01:48 | confidence interval for the mean.
| | 01:51 | That's an indication of how spread out
things are in mean and also given our
| | 01:56 | sample what we think a true
population value might be.
| | 02:00 | We also have what are called
M-estimators. That's a whole family of advanced what are
| | 02:04 | called robust estimators that work
well when things are skewed or they're
| | 02:09 | outliers, but it's rather advanced.
| | 02:11 | We are not going to deal with that.
| | 02:13 | I can also get information about outliers, which
might label them individually. I could do that.
| | 02:19 | I don't think we need to.
| | 02:20 | I could also get percentiles, where for
instance it gives me the values for the
| | 02:24 | 5th, 10th, 25th, 50th, 75th,
90th, and 95th percentiles.
| | 02:30 | You can do it manually in the
Frequencies command, but it's nice to have it as
| | 02:33 | a one-click option.
| | 02:34 | However, I usually don't need that,
so I am going to skip it right here.
| | 02:38 | I'm just going to click Continue.
So I am leaving the statistics at the default.
| | 02:41 | It has given me a ton.
| | 02:43 | Next, I am going to look at Plots or the graphs.
| | 02:47 | Now the first thing you can do is
give me box plots, and we've done those
| | 02:50 | separately in the univariate charts.
And it's going to factor the levels
| | 02:54 | together, which is fine,
because I'm not splitting up the factors.
| | 02:58 | It can also give me something called a
stem-and-leaf plot, which is something
| | 03:01 | that's normally drawn by hand, but
I will show you that in a moment.
| | 03:06 | I can get a histogram if I wanted.
| | 03:07 | I've done those before, but I
can get them additionally here.
| | 03:11 | The next one is normality plots with tests.
| | 03:13 | This is a series of plots that are
designed to see how well your data fit a
| | 03:19 | symmetrical normal distribution--
| | 03:22 | that's a mathematical
definition of a bell curve.
| | 03:25 | Normality is the term for it, and
that's important for a lot of statistics, but
| | 03:30 | the normality plots can be a little
tricky to read, and usually you can eyeball
| | 03:35 | and see if your data seems to be behaving
well, the way they would work well with
| | 03:39 | a lot of other statistics.
| | 03:40 | So I am going to skip both of those.
| | 03:42 | I'll just click Continue, and
let's take a quick look at options.
| | 03:46 | Now this is one where it asks what
to do with missing values in case I'm
| | 03:50 | looking at more than one variable
in my Dependent List, which I am.
| | 03:53 | The question is whether I want to
exclude cases listwise or pairwise.
| | 03:58 | And this is something that comes up
in a number of other procedures, and
| | 04:01 | it's worth pointing out.
| | 04:01 | When you exclude cases listwise, what
that means is you only include the case if
| | 04:07 | it has information on every
variable that you're including.
| | 04:12 | So let's say I had ten
variables in the Dependent list.
| | 04:15 | If a case was missing information on
one of those, it would not be included.
| | 04:19 | On the other hand, pairwise says
include them whenever they have variables
| | 04:25 | with some information.
| | 04:27 | So it makes maximum use of the
information, but you can end up with very
| | 04:31 | different sample sizes, and there are
procedures where it's very important to keep
| | 04:36 | the sample sizes consistent going across.
| | 04:38 | For Explore, that's a judgment call.
| | 04:41 | You can do it either way.
| | 04:42 | You can do it both if you want,
one after the other.
| | 04:44 | But I am just going to keep it
listwise for now, the way it is.
| | 04:47 | Click Continue and then down here it
gives me the option to display just the
| | 04:52 | statistics, just the plots, or both.
| | 04:56 | I will leave it at both, which is the default.
I click OK and I get a lot of output.
| | 05:02 | The first one tells me how many cases
there are and whether they have valid
| | 05:05 | data, how many are missing.
| | 05:06 | There are 2,816 cases with missing data,
and in each case I have four that are
| | 05:13 | missing information on LastSale and MarketCap.
| | 05:15 | That's just 1/10th of 1%.
| | 05:18 | Then I have a table called Descriptives.
| | 05:20 | I scroll down and I have the mean.
| | 05:22 | The mean for LastSale is $18.7, and
I've seen these statistics elsewhere, but
| | 05:28 | this one gives me a confidence
interval for the mean, which is an inferential
| | 05:31 | statistic, and we will see more
about those in the next section.
| | 05:34 | We also have something
called a 5% Trimmed Mean.
| | 05:37 | It shows us a way the highest and
lowest few percentage points of the data and
| | 05:41 | gives a slightly more stable estimate.
| | 05:44 | We have the median and the indicators
that spread with the variance and the
| | 05:48 | standard deviation, and then we have
several other statistics: the quartile and
| | 05:52 | the skewness and kurtosis.
| | 05:54 | So this is a lot of
statistics that it gives all at once.
| | 05:58 | You don't need all of them, but the nice
thing is that they are available there.
| | 06:03 | The second column, by the way, gives
what are called standard error estimates
| | 06:06 | for a few of the statistics, for the
mean, the skewness, and the kurtosis.
| | 06:11 | These are sometimes used as inferential
statistics, but we don't need to worry
| | 06:15 | about them right now.
| | 06:16 | Then it repeats the table for the
second variable, market capitalization.
| | 06:21 | Then we have what are
called the stem-and-leaf plots.
| | 06:25 | These are ones that are usually drawn
by hand, and what is it is it takes the
| | 06:29 | values and splits them up into
two-digit numbers, where the first digit is
| | 06:34 | what's called the stem, and it
forms the line here on the side.
| | 06:38 | The second number is the leaf, and
the neat thing about this is this can be
| | 06:43 | read as a histogram.
| | 06:44 | It's sort of a sideways histogram.
| | 06:46 | But it also maintains the
actual numerical values.
| | 06:49 | So it's both a literal display of the
data and a chart of a histogram, and then
| | 06:55 | it marks some extreme
cases separately at the bottom.
| | 06:58 | Then here's a box plot.
| | 06:59 | This is labeling the cases by their
stock prices, and then we do a similar thing
| | 07:04 | for market capitalization.
| | 07:06 | So the biggest impression you might get
might be that the Explore procedure is
| | 07:11 | good for producing enormous amounts of output.
| | 07:13 | It can be overwhelming, but if you
really want to get the best picture or
| | 07:18 | meaning the most comprehensive, not
necessarily the most interpretable or
| | 07:22 | useful picture, then the Explore
command is the procedure of choice.
| | 07:27 | It can give you stem-and-leaf plots.
| | 07:29 | It can give you confidence
intervals and trimmed means.
| | 07:31 | It can give you robust estimators.
| | 07:33 | It can give you normality plots,
among other things, if you ask for them,
| | 07:37 | all of which recommend its
use in particular circumstances.
| | 07:40 | On the other hand, the slightly simpler
procedures of Frequencies and Descriptives
| | 07:45 | can still give you nearly all of what
you need without deluging you with output.
| | 07:50 | Nevertheless, if there's one thing
SPSS is good at, it's providing you
| | 07:53 | with options, and the Explore command
is one with especially rich options
| | 07:58 | and analytical value.
| | Collapse this transcript |
|
|
6. Inferential Statistics for One VariableCalculating inferential statistics for a single proportion| 00:00 | For many people, when they think of
statistics, they think of inferential
| | 00:04 | statistics, and not always fondly.
| | 00:07 | Of course, there is much more to
statistics and data analysis than the
| | 00:10 | calculation of probability values, and
this should be evident by the amount of
| | 00:14 | time we spent so far on
graphics and descriptive statistics.
| | 00:17 | However, the ability to go beyond the
data at hand and make inferences about a
| | 00:22 | larger group of people--hence the name
inferential statistics--is one of the great
| | 00:26 | beauties of analysis.
| | 00:28 | In this set of movies, I want to start
with the simplest kinds of inferential
| | 00:31 | statistics, those for one variable at a time.
| | 00:34 | There are few different procedures that
we'll cover, such as confidence intervals
| | 00:38 | and hypothesis tests, for scale
variables and proportions, as well as the
| | 00:42 | distribution of a single categorical variable.
| | 00:45 | But let's start with what is probably the
simplest and most familiar, the confidence
| | 00:49 | interval and hypothesis
test for a single proportion.
| | 00:52 | For this example, I'm going to be
using the GSS.sav data set. That stands for
| | 00:57 | General Social Survey. And it has one variable
on the end here that I think is interesting.
| | 01:02 | If I scroll to the end, I have a
variable here that's called ReadBook, and what
| | 01:06 | it means is whether the person says that
they've read a novel, a poem, or a play in last year.
| | 01:11 | We might be interested in the percentage
of people who say that they have read one,
| | 01:16 | whether that is significantly higher
then, for example 50% and what the
| | 01:21 | confidence interval for that might be,
like you would get from a political poll
| | 01:24 | where they say 73% of respondents
plus or minus 3% who are in favor of a
| | 01:29 | particular candidate.
| | 01:31 | To do this, I'm going to many use one of
SPSS's more interesting features. It's
| | 01:35 | called nonparametric tests, and I get to
it by going to the Analyze menu, down to
| | 01:40 | Nonparametric Tests.
| | 01:42 | It's called nonparametric because
we're not using parameters like means and
| | 01:45 | standard deviations.
| | 01:47 | Then I come over to One Sample.
| | 01:49 | And here it will do a lot of things
automatically, but I'm going to be a little
| | 01:53 | bit selective and customize
it to actually make things simpler for right now.
| | 01:57 | The first thing I'm going to do is I'm
going to come here to Fields, and that
| | 02:01 | really means variables.
| | 02:02 | And right now it's
putting in nearly every variable.
| | 02:05 | It would test for equality of
distribution on categorical variables, and it
| | 02:10 | would also test for scale variables, whether
they are normally distributed like a bell curve.
| | 02:15 | I don't want to do all of that,
| | 02:16 | so what I'm going to do is I'm
going to take all of these variables,
| | 02:19 | I'm going to put them
back into the original field.
| | 02:23 | The only test variable that I want is this one:
| | 02:27 | Read Novel, Poem, or Play.
| | 02:29 | So I'll double-click to move that over.
| | 02:30 | Then I go to the Settings tab to choose
exactly what test it is that I want to do.
| | 02:34 | Now I'm going to do Customized tests
here, and I'm going to choose Compare the
| | 02:39 | observed binary probability--binary
means two answers: yes or no--to the
| | 02:44 | hypothesized value with
what's called the binomial tests.
| | 02:47 | And click on Options, and what it's
going to do is it's going to do a hypothesis
| | 02:51 | test to see if the proportion of people
who say they've read a novel, poem, or
| | 02:55 | play in the last year is
statistically significantly different from a
| | 02:59 | hypothesized proportion, which
right now I'll leave at 50%.
| | 03:03 | I can also get what's called the
confidence interval. That's like the plus or
| | 03:06 | minus 3% in a political poll.
| | 03:08 | Now sometimes you can use conventional
statistics, but right here SPSS is doing
| | 03:12 | a very nice thing and it's letting me
use what's called an exact statistic.
| | 03:17 | In this case, it's called the Clopper-
Pearson for the confidence interval.
| | 03:20 | We don't need to go into any details
except to say this would be a good choice.
| | 03:24 | So I'm just going to click on that and
I'm going to come down and press OK, and
| | 03:28 | then I'm going to press Run.
| | 03:30 | Now the output for this looks little
different from what we've had so far,
| | 03:33 | because it's a table with
colors and shading in it.
| | 03:37 | Also, it's not showing me everything right now.
| | 03:39 | This is actually what's called a model viewer.
| | 03:41 | Now right now, all it's telling me is
that the proportion of people who say
| | 03:45 | they've read a novel, poem, or play
in the last year is significantly
| | 03:48 | different from 50%.
| | 03:50 | It's not telling me what's the actual
proportion was or how far away it is, but
| | 03:54 | I can get that through
going onto the Model Viewer.
| | 03:57 | I'll double-click here and
it brings up the Model Viewer.
| | 04:00 | I'll maximize that window. And what I have
here is the output that I saw on the other page.
| | 04:05 | It tells me that the proportion of people
who say they've read one of these is not 50%.
| | 04:09 | It's significantly different from 50%.
| | 04:12 | In fact, what I can do is I can come
over here and the hypothesized, that 50%, is
| | 04:16 | this blue bar right here.
| | 04:18 | But what I really have is an observed
71% of the people say that they've read a
| | 04:22 | novel, poem, or play in the last year.
| | 04:25 | That's out of 349 people, and this tells me
that that is significantly different from 0.
| | 04:31 | To get the confidence interval,
I need to do one other thing.
| | 04:34 | I come back over to this left pane
and I go down to where it says View.
| | 04:38 | Right now we're looking
at the Hypothesis Summary.
| | 04:41 | If I click on that, I can get
the Confidence Interval Summary.
| | 04:45 | It's a slightly different table here,
and it tells me how it calculated the
| | 04:49 | confidence interval by
using the Clopper-Pearson.
| | 04:51 | It tells me what the Parameter was, the
probability that a person read a novel, a
| | 04:55 | poem, or play in the last year.
| | 04:57 | It tells me that the proportion of
people who said yes, because they put ones
| | 05:01 | instead of zeroed, is 71%.
That corresponds to what I have over here.
| | 05:06 | The yes is the 71%.
| | 05:08 | The confidence interval at the 95%
confidence interval, which is the most
| | 05:12 | common, is from 66% to 76%.
| | 05:16 | And what this means is that while in my
sample of 349 people 71% may have said
| | 05:22 | they've read these, in the population
of those 349 people came from, the true
| | 05:26 | value could be somewhere between 66% and 76%.
| | 05:30 | This is like the plus or minus 5%
that you would get from a political poll.
| | 05:35 | So the new nonparametric tests in SPSS
is actually a very flexible procedure
| | 05:40 | that can perform an entire
range of tests all on its own.
| | 05:43 | It's also the easiest way to get
confidence intervals and hypothesis tests for
| | 05:48 | a single proportion.
| | 05:49 | We'll come back to this procedure in
another movie on testing nominal variables
| | 05:53 | with multiple categories, but for now
this should give you a good start on
| | 05:57 | dealing with inferential statistics
for dichotomous variables in SPSS.
| | 06:01 | In the next movie, we'll look at
common tests for scale variables.
| | Collapse this transcript |
| Calculating inferential statistics for a single mean| 00:00 | SPSS makes it very easy for you to
go beyond your sample data and make
| | 00:05 | inferences about the
population that those data came from,
| | 00:08 | that is, you can
calculate inferential statistics.
| | 00:11 | In the last movie, we looked at how
to work with proportions for a single
| | 00:14 | dichotomous variable--
| | 00:16 | that's a yes/no, 0/1 variable--
to get a hypothesis test and a
| | 00:20 | confidence interval.
| | 00:21 | In this movie, we will do the same
procedure for a scale variable, something
| | 00:25 | that could be measured in set units, like
time to complete a project or bids from vendors.
| | 00:29 | I am going to use the same data set
for this one, the GSS, or General Social
| | 00:33 | Survey.sav, data set, and this time I'll be
looking at the one variable here that's
| | 00:39 | called FamilyIncome that measures
the total family income in dollars.
| | 00:43 | Now I should point out that these are
actually the midpoints for categories,
| | 00:47 | which is why they seem to be very
precise amounts, and you will see them repeated,
| | 00:52 | like here's 115,841, and
here's the same number again.
| | 00:57 | Nevertheless, these are scale variables
because the dollars move in set amounts.
| | 01:02 | So I am going to be doing a hypothesis
test and a confidence interval for the
| | 01:06 | family income for the 349
people in this particular sample.
| | 01:10 | Now there's two ways to do this,
and both of them go in the Analyze menu.
| | 01:15 | For the first one, I am going to come up
to Analyze and I am going to go Compare
| | 01:19 | Means and I am going to use
what's called the One-Sample T-Test.
| | 01:23 | And all I need to do here is I
need to pick the variable that I want.
| | 01:27 | In that case, it's FamilyIncome.
| | 01:28 | So I just double-click on
that and it moves it over.
| | 01:32 | Let's look at some of the options.
| | 01:34 | I can get a confidence interval, and I
can change it from 95% to some other
| | 01:38 | values, sometimes 90% or 80% is
appropriate, but 95% is the most common.
| | 01:44 | So I am going to leave it right there.
| | 01:45 | So I'll click Continue.
| | 01:47 | I'm going to ignore the bootstrap,
because that's there because of an extra
| | 01:51 | add-in that's installed in this version
of SPSS that normally you have to pay for.
| | 01:56 | Below the test variables box, I have
another box that says test value, and this
| | 02:01 | is the value that SPSS is going to
compare the mean family income to, to find
| | 02:06 | out if it's significantly different from it.
| | 02:08 | Now I can guarantee you that the
mean family income is not going to be 0,
| | 02:12 | so I am going to pick
another number to put there.
| | 02:15 | Let's say, for instance, I want to
compare it to $45,000 for family income.
| | 02:20 | This is how I can do it to find out
whether this average value is higher or
| | 02:24 | lower than that significantly.
| | 02:25 | So now I click OK, and what I
have is one sample of statistics.
| | 02:30 | It tells me that I have 349 people, that
the mean family income is $32,781 with a
| | 02:37 | standard deviation of 29,000.
| | 02:40 | The last one, the standard error, is
used in calculating the hypothesis test and
| | 02:45 | the confidence intervals.
| | 02:46 | Below that I have what's called a One-
Sample Test where SPSS is taking the average
| | 02:52 | value, the mean of 32,781, and comparing
it to a hypothesized value of $45,000.
| | 03:00 | The first column has what's called the
t statistics, and that's an inferential
| | 03:03 | statistic, and it doesn't
necessarily mean a lot on its own.
| | 03:07 | The second one is the degrees of
freedom, which has to do with the sample size.
| | 03:10 | It's the third one in particular
that we want to look at. It says Sig. (2-tailed).
| | 03:15 | That's the significant value, or the
probability value for the hypothesis test.
| | 03:19 | And in this case that number is 000.
| | 03:22 | Now it's not literally 0.
| | 03:23 | it's just it's less than 001,
| | 03:26 | so it shows up truncated here.
| | 03:28 | What this tells me is that the observed
average value of $32,781 per year for a
| | 03:35 | family is significantly different
from my hypothesized value of 45,000.
| | 03:40 | I was optimistic in my hypothesis.
| | 03:43 | Now these last two columns have
what's called confidence interval for the
| | 03:46 | difference from the mean.
| | 03:47 | You see that the mean difference that's in
the third column from the end is -12,000.
| | 03:51 | That's because the observed
value is about $12,000 less than my
| | 03:56 | hypothesized value.
| | 03:58 | These last two columns give me the
confidence interval for that difference.
| | 04:02 | Now an interesting thing here is had the
hypothesized value been 0, these would
| | 04:06 | have been an actual conference
interval for the mean, but because I felt that
| | 04:11 | having 0 would be a silly test value,
I put something else in. The confidence
| | 04:16 | interval is for the difference.
| | 04:17 | Now if I want a regular confidence
interval, a better way to get that, instead of
| | 04:22 | from the T-Test, is to go back to a
procedure we looked at in the last set of the
| | 04:26 | videos, the Explore command.
| | 04:28 | I just go back up to Analyze >
Descriptive Statistics > Explore.
| | 04:34 | I take the one variable that I want out
of this list, which is FamilyIncome, and
| | 04:38 | I put it into the Dependent List, that
means outcome variables, or the ones we
| | 04:41 | are trying to chart.
All I want here is a list of statistics.
| | 04:45 | I am going to come down to Display
and click on Statistics and press OK.
| | 04:50 | I'm going to get a big table here, but
the only one I really want to look at is
| | 04:54 | this one that says 95% confidence
interval for the mean, with the lower bound
| | 04:57 | and the upper bound.
| | 04:58 | There's actually several ways of
interpreting a confidence interval, but one
| | 05:02 | sort of colloquial way is to say that
the population value is between 29,692 and
| | 05:09 | 35,871, so between 30,000 and 36,000.
| | 05:13 | There is about 95% chance that the true
population mean is between those two values.
| | 05:19 | Anyhow, SPSS makes it simple to
perform two of the most basic and two of the
| | 05:24 | most useful inferential
statistics for a single scale variable:
| | 05:28 | the One-Sample T-Test and
the simple confidence interval.
| | 05:31 | In the next movie, we will look at
something slightly more complicated as we
| | 05:34 | look at the distribution of cases across
a nominal variable with several groups.
| | Collapse this transcript |
| Calculating inferential statistics for a single categorical variable| 00:00 | In the last two movies we've looked at
the most basic inferential statistics,
| | 00:05 | the ones where we
analyzed one variable at a time.
| | 00:08 | We looked at the proportion for a
nominal variable, with only two outcomes, that
| | 00:12 | is, a dichotomous variable, and we
looked at the mean for a scale variable.
| | 00:16 | In both cases, we looked at both null
hypothesis tests and confidence intervals.
| | 00:21 | In this movie, we will expand
things slightly by looking at how to do a
| | 00:25 | hypothesis test for a nominal variable,
or a categorical variable, that has more
| | 00:30 | than two categories,
something like occupation or a favorite sport.
| | 00:34 | Although it's possible to do
confidence intervals for the number of people in
| | 00:37 | each category, it's a complicated
procedure, and it's not particularly
| | 00:41 | helpful for most purposes.
| | 00:43 | Instead, we'll just do a hypothesis test
that looks at whether people are evenly
| | 00:47 | distributed across all the
categories in the variable.
| | 00:50 | The test statistics that we'll use is
called the One Sample Chi-Square Test in SPSS.
| | 00:56 | It's also known as the Goodness-of-
fit Test, and with SPSS's new automatic
| | 01:01 | features, this is very
easy to create and interpret.
| | 01:05 | I am going to be using the same
data set as before, GSS.sav from the General
| | 01:09 | Social Survey, and I thought it might
be interesting to look at the variable
| | 01:13 | that is second from the last,
about people feeling happy.
| | 01:16 | Specifically the question
is self-rated happiness.
| | 01:19 | Well, we have three possible answers:
| | 01:21 | Not Too Happy, Pretty Happy, and Very
Happy. And we can use this test to see if
| | 01:27 | people fall evenly into
those three different categories.
| | 01:30 | To do this, we'd go to the Analyze menu,
and then down to Nonparametric Tests, and
| | 01:35 | again to One Sample.
| | 01:37 | This is the same one that we
used for the single proportion.
| | 01:40 | We're just going to be doing it a
little bit differently this time.
| | 01:44 | I need to go to the Fields tab, and then
I have all of the variables that it can
| | 01:49 | test in the Test Field thing.
I don't want all of them there.
| | 01:51 | It will be too much output.
| | 01:53 | So what I am going to do is I am going
to select all of these and put all of
| | 01:57 | them back, and then I'll bring back over
the only one that I want, which is near
| | 02:01 | the bottom of the list, and
it's Self-Rated Happiness.
| | 02:04 | I can double-click on that to move it over.
| | 02:07 | Then I can go with the default test.
All I need to do now is press Run, and I
| | 02:12 | get the same kind of table I got before.
| | 02:15 | It lets me know that the null hypothesis,
or that the categories of Self-Rated
| | 02:19 | Happiness, would occur with equal probabilities.
| | 02:22 | That is that we would have the same
percentage of people who said that they were
| | 02:25 | Not Too Happy and Pretty Happy and Very Happy.
| | 02:29 | All I can tell from this one is that
those three are not evenly distributed.
| | 02:33 | But this is an interactive model viewer,
| | 02:35 | so I double-click on it and
I will maximize that window.
| | 02:39 | And what I see is the hypothesized
value is the green bars over here, and what
| | 02:45 | it is, you see all three
of them are the same size.
| | 02:47 | The blue is how many I actually have.
The green is how many I would have
| | 02:50 | expected if things were distributed
evenly. And it tells me that I have an
| | 02:54 | observed 43 people who
said they were not too happy.
| | 02:57 | That's this blue bar right
here, that's the Observed.
| | 03:00 | The hypothesized was 116.
| | 03:03 | So the difference between
the two, the residual, is 73.
| | 03:06 | In fact, what you can see is that in
the first set, the Not Too Happy, I have
| | 03:11 | fewer people than I would expect
if people were evenly distributed.
| | 03:14 | On the other hand, I have a lot more
people in the middle set, Pretty Happy,
| | 03:18 | than I would expect.
| | 03:20 | The Very Happy is actually
right around one third of the group.
| | 03:24 | Down below that, I have a table that
gives me the total sample size, 349.
| | 03:28 | The Test Statistic there is called the
Chi-Square Test, and it's got a value of 88.06.
| | 03:34 | It has what's called 2 Degrees of
Freedom, and a Probability value, that's the
| | 03:38 | Asymptotic Significance
2-sided test of less than 000.
| | 03:42 | Again, it's not exactly 0, but
it's going to be a small number.
| | 03:47 | Anyhow, this is the easiest possible
hypothesis test for a categorical variable
| | 03:53 | that has several categories in it.
| | 03:56 | The One Sample Chi-Square Test, it's
a quick and easy way to tell if your
| | 03:59 | observations are distributed evenly
across categories, or you can also specify
| | 04:05 | some other expected way.
| | 04:06 | It shows how important it can be to
check whether the variation you see could be
| | 04:10 | reasonably attributed to random, meaning
less chance, or whether you might start
| | 04:14 | to see something important
that deserves further analysis.
| | Collapse this transcript |
|
|
7. Charts for Two VariablesCreating clustered bar charts| 00:00 | The last several sections of movies
have dealt with methods for examining
| | 00:04 | one variable at a time with graphs,
descriptives, statistics, and inferential procedures.
| | 00:10 | These kinds of univariate analyses can
be very interesting in their own right,
| | 00:14 | such as the number of people to vote
for a particular political candidate or
| | 00:18 | the amount of money spent on chewing
gum in the US each year, which I've heard
| | 00:21 | once is $500 million per year. And they form a
truly essential part of any further analysis.
| | 00:28 | That is they are foundational
essential background pieces of an analysis.
| | 00:33 | So before you look at any combinations
of variables you need to understand each
| | 00:37 | variable on its own. But with that
said, it's the associations between
| | 00:43 | variables that are often of
the most interest to people.
| | 00:46 | For example, I am also told that people chew
gum more often during times of social unrest.
| | 00:51 | Now, you can make with that what you
will, but it gets at the heart of the
| | 00:54 | great majority of real world data
analysis. How can you predict or explain one
| | 01:00 | thing based on another?
| | 01:01 | And as a first step to
understanding associations, like we did with
| | 01:06 | univariates, we're going to start
where you should always start in an
| | 01:09 | analysis: with a picture.
| | 01:11 | One of the easiest kinds of charts for
showing associations is the clustered bar chart,
| | 01:15 | which is particularly well suited
for showing the relationship between two
| | 01:19 | categorical variables.
| | 01:21 | For instance, Normal or Ordinal variables.
| | 01:24 | We covered simple bar charts earlier
when we looked at univariate charts and
| | 01:28 | they can be just as useful here.
| | 01:30 | In fact, the only real difference is
that we will now cluster variables by
| | 01:35 | grouping them on the axis across the bottom.
| | 01:38 | While the difference may seem small,
it really opens up a lot of analytical
| | 01:42 | possibilities in SPSS.
| | 01:44 | Now, to demonstrate this, I am going to
be using the data set Searches.sav, about
| | 01:50 | Google searches, and how
they vary from state to state.
| | 01:53 | In this particular example I am going
to look at two variables that are near
| | 01:56 | the end on the right.
| | 01:57 | What I am going to look at,
whether a state has an outline for a high
| | 02:02 | school statistics class and I am
going to compare that to the region of the
| | 02:06 | country that they are in.
| | 02:07 | There are four regions.
| | 02:08 | So that's a categorical variable with
four categories and statistics education
| | 02:13 | is a dichotomous yes/no.
| | 02:15 | And I am going to look and see if
the proportion of states with statistics
| | 02:20 | curriculum varies from one region to another.
| | 02:25 | Now, to do that, I am going to go up to
Graphs, to the Chart Builder, and I am
| | 02:30 | going to come down to Bar chart
and choose clustered bar charts.
| | 02:35 | I am going to drag that up to the
canvas and then I need to take one variable
| | 02:40 | and put it in the X-axis and the other
variable to set the colors of the bars.
| | 02:45 | What I am going to do is I am going to
put the region in the X-axis, and for no
| | 02:48 | other reason I have four regions and I
don't want to have four different colors
| | 02:52 | in my chart, but also you're going to
see how this allows me to make a yes/no
| | 02:56 | comparison more easily between each group.
| | 02:59 | What I am going to do is I am going to
get the region variable, which is near the
| | 03:02 | bottom of the dataset.
| | 03:03 | That's this one right here,
the Census Bureau Region.
| | 03:06 | I am going to drag that down to X-axis
and then for this one on the top-right
| | 03:11 | that says Cluster on X: set color,
| | 03:13 | I am going to take whether they
have an outline for high school statistics.
| | 03:17 | That's this variable right here.
| | 03:19 | So I am going to drag that over to cluster,
and I think that's all I really need right here.
| | 03:25 | So I am going to come down and click OK.
| | 03:28 | When we first get the
output, we get a lot of text.
| | 03:30 | This is the command that you
could write to produce this chart.
| | 03:34 | Beneath that is the chart itself.
| | 03:36 | It's just blue and green bars, and what
it has is a pair of bars for each Census
| | 03:41 | Bureau Region from the Northeast, and
the Midwest, the South ,to the West, and
| | 03:46 | the blue bar means that the state
does not have an outline for high school
| | 03:50 | statistics class, but a
green bar means that it does.
| | 03:53 | There are a couple of things that jump
out immediately. First, is that in the
| | 03:57 | Northeast not a single state has an
outline for a high school statistics class.
| | 04:02 | The Midwest has just one, and the
West has just three, but the Southern
| | 04:07 | region, there are more states that
have outlines or high school statistics,
| | 04:11 | than there are without them.
| | 04:13 | That's extraordinarily unusual.
| | 04:15 | That's a very different pattern.
| | 04:17 | Now there is one challenge with this
particular chart and that is that there is
| | 04:23 | not the same number of states in each
region, and so it can make it a little
| | 04:26 | difficult to compare from one to the other.
| | 04:29 | Fortunately, the Bar Chart command
lets us do something significant here.
| | 04:34 | What I am charting right
now on the side is the counts.
| | 04:38 | That's the number of states that do
or do not have an outline for a high
| | 04:42 | school statistics class.
| | 04:43 | I am going to change that though to be a
percentage and here's how we're going to work.
| | 04:48 | I am going to go back to Graphs, to
the Chart Builder, and I am going to pick
| | 04:54 | up where I left off, except right here
it says Count on the side, and if I go
| | 04:59 | over to the Element Properties window
where it says Bar, right here under
| | 05:04 | statistics it says Count.
| | 05:05 | If I click on that, I actually
have a huge number of options.
| | 05:11 | I can specify tremendous number of things.
| | 05:13 | What I am going to do is I
am going to click Percentage.
| | 05:16 | Now the reason that has a
question mark in parenthesis
| | 05:19 | is because I need to set the
parameters for the percentage.
| | 05:22 | It's asking me a percentage of what?
| | 05:25 | I click on that. I don't want the grand total.
| | 05:28 | What I do want is each X-axis
category, that is, each region.
| | 05:33 | I want to know what percentage of the
schools in each region do or do not have
| | 05:39 | a high school statistics curriculum.
| | 05:41 | So I am going to click on that one and
press Continue, then I come down to the
| | 05:45 | bottom of the Elements window and
press Apply, then back over to the main
| | 05:48 | window and press OK.
| | 05:50 | We get the text output and then I
scroll down and I have another chart.
| | 05:55 | And you can see this one looks
slightly different and it's because it's
| | 05:58 | adjusting it for the
differences in the sizes of the regions.
| | 06:01 | We still see that in the Northeast
none of the schools have an outline for =
| | 06:06 | high school statistics class.
| | 06:07 | That's why the blue line, the No,
goes all the way up to 100%.
| | 06:11 | In the Midwest, only 10% of the
schools, in the South, over 50% have a
| | 06:17 | curriculum, and in the West, it's
just over 20%, and that's another way of
| | 06:23 | adjusting for differences to make a
little easier to interpret. You usually want
| | 06:27 | to compensate for the differences
in the sample sizes and look at the
| | 06:31 | percentages or the rates in a
particular area, and that's one of the beautiful
| | 06:35 | things about SPSS, is how easy it
makes that particular procedure.
| | 06:39 | So the first kind of association chart
that we've covered, the clustered bar chart,
| | 06:44 | is a small variation on a
univariate bar chart, and it's a great way of
| | 06:48 | showing the association
between two categorical variables.
| | 06:52 | This command makes a very clean, simple,
and easy to interpret chart, which is
| | 06:57 | the real goal of data
visualization, is statistical graphics.
| | 07:01 | In the next movie, we will look at
using scatter plots to show the associations
| | 07:06 | between two scale variables.
| | Collapse this transcript |
| Creating scatterplots| 00:00 | In the last movie we talked about how
to chart the relationship between two
| | 00:04 | categorical variables with clustered bar charts.
| | 00:07 | On the other hand, if you have
two scale variables, also called
| | 00:10 | quantitative variables or measured
variables, then your best choice is
| | 00:13 | almost always a scatter plot.
| | 00:15 | Scatter plots are familiar to most people.
| | 00:18 | There's an x axis across the bottom and
a y axis up this side, and each person
| | 00:22 | or case gets a dot to show the
combination of their two scores, like height and
| | 00:27 | weight or high school and college GPA.
| | 00:30 | In general you want to put your
predictor variable on the bottom, on the x axis,
| | 00:33 | and your outcome variable or the thing
you're trying to predict on the y axis,
| | 00:37 | and SPSS makes the whole process very simple.
| | 00:41 | You can create a scatter plot with
the Chart Builder in just a few steps.
| | 00:45 | And for this example I'm going to
be using the same Google searches
| | 00:48 | information in Searches.sav.
| | 00:50 | I am going to come up to Graphs, to
Chart Builder and then in the Gallery I
| | 00:55 | will choose Scatter, and
just use a Simple Scatter plot.
| | 00:59 | I will drag that up to the canvas.
| | 01:01 | And then in this particular example,
I'm going to take interest in SPSS as a
| | 01:06 | relative interest as a search term and
put it on the x axis, and then I am going
| | 01:11 | to take one that may seem a little
peculiar, but the search term, Totally Lost,
| | 01:16 | and put that on the y axis.
| | 01:18 | I'm also going to make it possible
for me to identify points by clicking on
| | 01:22 | the Point ID label.
| | 01:24 | That brings up a box in the canvas.
| | 01:26 | and I can come up here and I can take
the state code and drag that in and that
| | 01:31 | should be enough for right now.
| | 01:33 | I'll click OK and
here's my general scatter plot.
| | 01:37 | And what you see is first off a lot of fuzz,
because I have dots and I have the state labels.
| | 01:42 | I am going to take care
of those in just a second.
| | 01:44 | But it's clear that there's a very
strong linear uphill trend, that places that
| | 01:50 | show greater relative interest in
SPSS as a search term in Google also for
| | 01:55 | reasons that may not be totally clear
show greater use of the search term
| | 02:00 | Totally Lost as they go through.
| | 02:03 | Now, I am going to clean
up this chart in a few ways.
| | 02:06 | I am going to try to go through it
relatively quickly and give you an idea
| | 02:09 | of what's possible.
| | 02:10 | To edit the chart you need to double-
click on it, and what I am going to do is
| | 02:14 | I am going to turn off all of the state
labels by going to Elements and Hide Data Labels.
| | 02:19 | I will bring back just one or
two of them for illustration later.
| | 02:23 | There's a few things I want
to show you how to clean up.
| | 02:25 | For instance, you can change
almost anything by clicking on it.
| | 02:29 | I have selected the data points here
and I can make them instead of black
| | 02:33 | circles, I can make them red dots by clicking
red for the Border and then red for the Fill.
| | 02:40 | If I want to change the colors
of lines, I can do that as well.
| | 02:43 | I can also change the axis down here
from 3 decimal places by clicking on Number
| | 02:48 | Format and changing that to 0,
clicking Apply, and doing the same thing over
| | 02:54 | here, changing that to 0 and clicking Apply.
| | 02:57 | Now what I am going to do is I am
going to add a linear regression line.
| | 03:00 | This is also the basis of an
inferential procedure, linear aggression, that
| | 03:04 | we'll be coming to a little bit later, but
right now it's a very simple thing to do.
| | 03:08 | I just come up to the Button bar and
click on this one that says Add a Fit Line
| | 03:13 | at Total, and that's a regression
line that goes all the way through.
| | 03:17 | It also adds a little bit of information
right here that I don't need right now,
| | 03:20 | so I am going to select that and press Delete.
| | 03:22 | And then I've got a very clear,
strong, upward trend, higher relative
| | 03:27 | interest in SPSS as a search term, also higher
use of the word Totally Lost as a search term.
| | 03:33 | The one last thing I'm going to do is
I'm going to add an identifier to the
| | 03:37 | point that's in the top right.
| | 03:38 | We saw what it was earlier, but I am
going to add an identifier for just it.
| | 03:43 | By coming over to the left of this
button bar, clicking on the little target,
| | 03:47 | which is the Data Label Mode, I click
on that, and then I come back over and
| | 03:51 | click on that data point I want to
identify, and we see there that it's
| | 03:54 | Washington D.C., and that's
probably enough for this particular chart.
| | 03:58 | I want you to be aware that
there are many other options.
| | 04:01 | For instance, I can add vertical
and horizontal reference lines.
| | 04:06 | I can also change the kind of
regression line I have through.
| | 04:10 | For instance, this is called a linear
regression line, but if you're interested
| | 04:14 | in growth, like changes in stock
prices over time, you might want to use a
| | 04:18 | Quadratic or something called a Cubic.
| | 04:20 | If you want to see if it's a straight
line at all, you can find what's called a
| | 04:24 | Smoother, in this case it's called
Loess Smoother through the regression line,
| | 04:28 | and I encourage you to try these
alternatives, and it's actually possible to
| | 04:33 | overlay one on top of the other. But
for now I am going to leave this with a
| | 04:37 | straight regressioline as it
shows the linear patterns most clearly.
| | 04:41 | So I am going to close that and close that.
| | 04:44 | So the Scatter plot can give really
good insight into the relationship
| | 04:48 | between two scale variables and the
options that SPS gives for lines through
| | 04:52 | the data can help you explore how
well your data matched the assumptions of
| | 04:56 | standard linear regression.
| | 04:58 | In the next movie we'll look at a
special kind of scatter plot called the
| | 05:02 | Time Series Plot or Time Plot, where
the variable on the bottom is, not surprisingly, time.
| | Collapse this transcript |
| Creating time series| 00:00 | In the last movie, we looked at how to
create scatter plots for two quantitative
| | 00:04 | variables or scale variables in SPSS.
| | 00:08 | Now scatter plots are extremely useful,
for exploring new data, and they're
| | 00:12 | also extremely flexible.
| | 00:14 | One variation on this scatter
plot though deserves special mention.
| | 00:18 | The time series scatter plot or time plot.
| | 00:21 | As you might guess, the major difference
in this case is that the variable that
| | 00:25 | goes across the bottom on the
x-axis is some measure of time.
| | 00:29 | Another difference is that time plot
often have only one measurement for each
| | 00:33 | time period whereas scatter plots can
have, for example, lots of people who are
| | 00:37 | all at the same point on the x-axis.
| | 00:41 | Because all time plots usually have
only one observation at each point in time,
| | 00:45 | you can also connect the points,
which makes it more like a line chart.
| | 00:49 | And here's how it works in SPSS.
| | 00:52 | For this example, I'm going to be
using the data set that's called NDAQ.sav.
| | 00:57 | And what this is the price for
shares in the NASDAQ Exchange itself,
| | 01:02 | from 2002 through 2011.
| | 01:05 | It only has two variables.
| | 01:07 | It has the first market day of each
month and it has the closing price on
| | 01:12 | that day for each month.
| | 01:14 | Let's go up to Graphs and then to Chart
Builder and then down to Scatter and
| | 01:21 | choose the Simple Scatter, the top
left one, and drag it into the canvas.
| | 01:25 | The Date will go on the bottom and the
closing price for the NASDAQ stocks will
| | 01:31 | go on the left, and that's how we need
to do right here. I am going just going
| | 01:35 | to click OK, and what you see is a lot of dots.
| | 01:39 | Now, you can see the pattern.
| | 01:41 | It starts relatively low in '04 or '05,
shoots way up high in '06 and 'O8, comes
| | 01:48 | back down to earth in 2010, and
then starts to go back up again.
| | 01:53 | But there's a way to make this chart much clearer.
| | 01:56 | We just need to edit it,
and do a few different things.
| | 01:59 | So to edit it, like every other chart
first we double click on it to open
| | 02:03 | up the editing window.
| | 02:05 | And for this one, what we want to do is we
want to click on the button in the menu bar here.
| | 02:09 | It's called Add Interpolation Line.
| | 02:13 | And what this does is it draws a line
that connects every dot across the bottom.
| | 02:19 | This is the standard line plot, you would
expect every time. Now, if we stop right there,
| | 02:24 | it's not bad.
| | 02:25 | However, at this point the dots
actually get in the way, and so what we can do,
| | 02:30 | is we can click carefully on the dots.
| | 02:33 | So they are all selected and just hit
Delete, and we our left with the line
| | 02:37 | plot that shows the pattern more
clearly than the dots themselves, of things
| | 02:42 | starting slowly, skyrocketing and then coming
back down at the end of the dotcom bubble.
| | 02:47 | And that is a special case where the
predictor variable is time and you can
| | 02:52 | adapt the standard scatter plot to
show how a variable changes, in which case
| | 02:57 | it's now called the Time
Series Scatter Plot or Time Plot.
| | 03:00 | This is a good example of how SPSS
helps you customize your charts to make
| | 03:05 | them easier to read and
more useful in interpretation.
| | 03:08 | Up to this point, we have looked
at charts for the association of two
| | 03:12 | categorical variables and two scale variables.
| | 03:15 | In the next few movies, we will look at
the combination of the two kinds:
| | 03:18 | charts that show the association of one
categorical variable and one scale variable.
| | Collapse this transcript |
| Creating simple bar charts of group means| 00:00 | In this section on charts for the
associations between variables, we've looked
| | 00:05 | at how we can depict the
association between two categorical variables,
| | 00:09 | for example, with clustered bar charts, and
the association between two scale variables,
| | 00:14 | for example, scatter plots.
| | 00:16 | At this point, we'll move on to charts
that show the association between two
| | 00:21 | kinds of variables. THat is, charts
that look at one categorical variable and
| | 00:25 | how's it's connected with the scale variable.
| | 00:27 | Whereas the other combinations of
variables had clear preferences for the charts.
| | 00:32 | there are actually several useful
options for charting associations for
| | 00:36 | categorical and scale variables in combination.
| | 00:39 | The first of these is a simple
variation on the bar chart, adapted to show the
| | 00:43 | mean score for each group.
| | 00:45 | In this example, I am going to use the
GSS dataset and I'm going to show family
| | 00:50 | income as a function of the highest
level of education of the respondent.
| | 00:55 | To do that, I first go up to
Graphs and click on the Chart Builder.
| | 00:59 | From there, I come down to Bar in the
Gallery and I simply drag this simple
| | 01:04 | bar into the canvas.
| | 01:06 | On the X-axis, I am going to put my
categorical predictor variable, which is the
| | 01:10 | highest degree of education.
| | 01:11 | That's called highest degree,
and I drag that down to X-axis.
| | 01:15 | Now on the left of that,
on the Y-axis it says Count.
| | 01:19 | However, if I come to the variable list
and I get family income and I drag that over,
| | 01:24 | it changes from Count to Mean.
| | 01:27 | That's because it's a scale variable.
| | 01:29 | Now if I wanted to, I
could get other statistics.
| | 01:32 | I could get the Median, the Group Median,
the Mode, and truthfully, a very large
| | 01:37 | range of statistics, but I am
going to leave it with the Mean.
| | 01:40 | I am going to do one small variation, however.
| | 01:42 | I am going to ask it to
put on what are called error bars
| | 01:44 | confidence intervals.
| | 01:46 | These give some sort of indication
of what the difference might be in the
| | 01:49 | general population, as opposed to just a sample.
| | 01:52 | Once I check that, then I need to come
down and click Apply and then I come
| | 01:56 | over to the box and I click OK.
| | 01:59 | And here we see five bars that show
different levels of education, from Did Not
| | 02:03 | Finish High School, which has an
average family income of about $20,000 a year
| | 02:08 | in this particular data set, off
through Bachelor's Degree and Graduate Degree,
| | 02:13 | which have averages of about $50,000
a year in this particular data set.
| | 02:17 | Now I do feel it's important to clean
this chart up a little bit, so like the
| | 02:21 | others what I'm going to do is I am
going to double-click on it and I am
| | 02:25 | going to make a few clarifications,
because you want to reduce the amount of
| | 02:29 | clutter in the chart.
| | 02:30 | So what I am going to do first, so I am
going to click on this thing that says
| | 02:33 | Error Bars and just delete that.
| | 02:36 | Then I am going to change the error
bars, because I find the end to them
| | 02:39 | distracting. I come up to Bar Options
and change them to just Whiskers here
| | 02:44 | under Boxplot and Error Bar Styles.
| | 02:47 | Click OK. I am going to change
the color of the bars. I find that
| | 02:50 | an unattractive color.
| | 02:52 | Maybe I will make it a light green and
then I might want to make the text here
| | 02:59 | a little bit larger.
| | 03:00 | Now I could do something
interesting when I do that. There we go.
| | 03:04 | It just changes the space a little
bit and I find this to be a much clearer
| | 03:08 | diagram of the relationship between the two.
| | 03:11 | So I am going to close this now.
| | 03:12 | I'll close there and then I'll come up
to the editing window and click the red X
| | 03:16 | and there you have it.
| | 03:17 | A bar chart that shows the association
between income and between levels of education.
| | 03:23 | So bar charts are a great way to
show the association between categorical
| | 03:27 | variables and scale variables in general.
| | 03:29 | They are very clean and very easy to interpret.
| | 03:32 | As a note, one of the nice things
about SPSS is that it keeps things clean.
| | 03:36 | So while it's possible to edit the
bars and give them shadows or a foster
| | 03:40 | dimension, those options are hidden,
which is good, because they are
| | 03:44 | almost always bad ideas.
| | 03:46 | Those sorts of effects are often
called chart junk and most spreadsheets
| | 03:50 | and presentation packages make
it way too easy to engage in these
| | 03:53 | unfortunate practices.
| | 03:55 | SPSS on the other hand keeps thing
simple, keeps them clean, and keeps them easy
| | 03:59 | to interpret, which is the
entire purpose of data graphics.
| | 04:02 | Anyhow, with that in mind, we'll move
from bar charts to a fancier kind of
| | 04:07 | display for the association between
a dichotomous variable, that is one
| | 04:11 | which has two categories and a
scale variable, using something called a
| | 04:15 | population pyramid.
| | Collapse this transcript |
| Creating population pyramids| 00:00 | In the last movie we looked at how you
can create pie charts to show the mean or
| | 00:04 | maybe the median, for each
group on a categorical variable.
| | 00:08 | However sometimes, it can be more
helpful to see not just a single summary
| | 00:12 | statistic, but the entire
distribution of scores for each group.
| | 00:16 | One way to do this, provided your
categorical variable is a dichotomy, that is it
| | 00:20 | has just two values, is a variation
on the histogram or bell curve that we
| | 00:24 | looked at back in the
section on univariate charts.
| | 00:28 | In this case what we are going to
create is a pair of back-to-back histograms,
| | 00:32 | what SPSS calls a population pyramid.
| | 00:35 | For this example, I'm going to be
using the Searches.sav data file, and I am
| | 00:40 | going to be looking at relative interest
in NBA, as a search term, and compare
| | 00:46 | that with whether a
state has an NBA team or not.
| | 00:49 | Now I am going to do this by going up
to Graphs, to Chart Builder, and from
| | 00:54 | there, I come down to Histogram, because the
pyramid plot is a variation on the Histogram.
| | 01:00 | This one on the far right, Population
Pyramid, I drag that up to the canvas,
| | 01:05 | and then what I'm going to do is I am
going to come on this variable list and
| | 01:09 | scroll down until I find the results
for NBA as a Google search term, and I
| | 01:15 | take that over to the distribution variable.
We are trying to find out how common that is.
| | 01:19 | Then I am going to split it by
whether the state has an NBA team.
| | 01:24 | That's this variable right here and I
take that up to the split variable, and
| | 01:28 | from there I can just press OK.
| | 01:31 | And what we find in this one is that
the states that have an NBA team, the
| | 01:36 | ones on the right side in the green,
tend to have the higher scores on the
| | 01:41 | relative interest in NBA as a search
term in Google, as opposed to the states
| | 01:45 | that don't have NBA teams.
| | 01:47 | For instance, on the right we see that
there are two states that have relative
| | 01:52 | interest in NBA, right around three
standard deviations above the mean.
| | 01:56 | On the other hand we see of the states
that don't have NBA teams, a lot of
| | 02:01 | them are below zero,
around negative one.
| | 02:04 | And so this is a way of looking at
things back to back in Histogram and making
| | 02:08 | the differences between
the two sets really obvious.
| | 02:11 | Now if you want to, you can double-
click on this chart and you can change the
| | 02:16 | colors on each side,. You can change the bins.
| | 02:18 | You can change the number of decimal
places on the side, the same way that we've
| | 02:23 | edited nearly everything else.
| | 02:25 | But this one is probably clear enough as it is.
| | 02:28 | So a population pyramid, that is, a back-
to-back histogram, this can be a new way
| | 02:34 | to compare the distribution of a
scale variable across two different groups.
| | 02:39 | Like a regular Univariate Histogram,
it lets you examine the shape of the
| | 02:42 | distribution, let's you check visually
for outliers, and lets you identify any
| | 02:46 | possible quirks in the data that
might throw off later analyses.
| | 02:50 | In the next movie, we will look at
one final display for showing the
| | 02:53 | association between the categorical
variable and skilled variable, what's
| | 02:58 | called grouped boxplots.
| | Collapse this transcript |
| Creating simple boxplots for groups| 00:00 | In this movie, on graphing the
association between two variables, we will
| | 00:04 | look at what SPSS calls simple
boxplots, which is a series of boxplots for
| | 00:09 | a single scale variable, broken down by
the groups in a single categorical variable.
| | 00:15 | One of the main benefits of this
particular chart is that it allows you to check
| | 00:19 | for outliers separately for each group.
| | 00:22 | This is important because a variable
may not have any outliers, when all of
| | 00:26 | the cases are considered together, but can
have an outlier when groups are separated.
| | 00:31 | For example, enough people in the
sample might be 6'4" tall, that it might not
| | 00:36 | be considered an outlier overall, but
that it almost certainly would be an
| | 00:39 | outlier, if you looked at the
heights of men and women separately.
| | 00:42 | So, here's how to break
boxplots down by various categories.
| | 00:47 | For this example, I am going to be
using the Searches database again from
| | 00:51 | Google, Searches.sav, and except in
this case I am going to be looking at the
| | 00:55 | relative interest in search for this
one variable, Modern Dance as a search
| | 01:00 | term and break it down by region.
| | 01:02 | To do this, I am going to go up to
Graphs, to Chart Builder, and I am going to
| | 01:07 | come down to Boxplot, and I am going
to take this first one which is called
| | 01:11 | the Simple Boxplot and drag it up to
this canvas, and from there I'm going to
| | 01:16 | get the Region variable, that's this one, Census
Bureau region, and drag that down to the X axis.
| | 01:22 | Then I'm going to get the variable
that shows the relative interest in Modern
| | 01:27 | Dance as a search term. From there I'm
going to add group and point IDs. This is
| | 01:32 | helpful when you're labeling outliers,
which often show up in boxplots.
| | 01:37 | So I'm going to come down and click on
Point ID label, and then I am going to
| | 01:42 | get the State Code from the variable
list, and drag that over, and that's all I
| | 01:47 | need for right now. So I am
going to come down and press OK.
| | 01:51 | And what you find rather surprisingly
is that Utah is an extraordinarily
| | 01:57 | high outlier on the far right, been
four-and-a-half Standard Deviations above
| | 02:02 | the national average in the relative
mind sharing interest in Modern Dance as
| | 02:07 | a Google search term.
| | 02:08 | You might associate Modern Dance with
the city like New York and the Northeast,
| | 02:14 | and you do see that New York is an
outlier on the left side, but still it's at
| | 02:18 | only about a value of one
standard deviation above the mean.
| | 02:22 | And you can see that there are others
at a much lower interest and the Midwest
| | 02:25 | is generally below 0,
that they are negative.
| | 02:29 | And so, this is a good way of
looking at the relative differences in
| | 02:33 | distributions especially in
outliers of one group across another.
| | 02:38 | The Simple Boxplot is a great way to
compare the distributions of a single
| | 02:42 | scale variable, for the different
groups in the categorical variable, and again
| | 02:47 | because it's especially important to
identify outliers because they can wreak
| | 02:52 | havoc with the statistical procedures,
| | 02:54 | it's an important consideration before
going on to further analysis, like the
| | 02:58 | inferential statistics for
associations that we will cover in the next several movies.
| | Collapse this transcript |
| Creating side-by-side boxplots| 00:00 | In the last movie on graphing, we
looked at how SPSS could create boxplots for
| | 00:05 | a single scale variable broken down by the
groups and a single categorical variable.
| | 00:10 | Another variation on boxplots that can
be handy is to show boxplots for several
| | 00:15 | different variables side-by-side, and
while this isn't technically a chart of
| | 00:20 | the association between variables,
| | 00:21 | it's a very useful chart that
addresses multiple variables.
| | 00:25 | These side-by-side boxplots work well as
a shortcut method for checking outliers
| | 00:31 | on several variables at once.
| | 00:32 | They are a great presentation graphic
for showing the distribution of several
| | 00:37 | variables and that way they could be
considered a much more compact alternative
| | 00:42 | to showing multiple histograms.
| | 00:44 | The only real catch is that your
variables need to be on the same scale, for
| | 00:48 | instance they could all be opinion
questions on a 1 to 5 strongly disagree to
| | 00:53 | strongly agree scale, or they could all
be dollar values in thousands of dollars.
| | 00:58 | The other trick is that this feature
was not included in SPSS's otherwise
| | 01:02 | remarkable and comprehensive Chart
Builder. Instead we will need to use what
| | 01:06 | SPSS calls a legacy dialog
and here is how it works.
| | 01:11 | For this example I am going to be using
the Google Search's information because
| | 01:14 | I have multiple interesting
variables on the same scale.
| | 01:17 | I am going to go to Graphs, down to
Legacy Dialogs, and from there I go down near
| | 01:25 | the bottom to Boxplots.
| | 01:27 | Now I have a choice here of Simple
which means without breaking things down by
| | 01:32 | group or Clustered where I am
breaking things down by groups.
| | 01:35 | In this particular case I want to
choose this option that says Summaries of
| | 01:39 | separate variables, and I click Define.
| | 01:42 | All I need to do is pick the
variables that I want to put in.
| | 01:45 | Just to show what you are able to do,
I am going to take all of the Google
| | 01:48 | Search terms from SPSS down through
FIFA and put them into Boxes Represent.
| | 01:56 | Also, because when you are looking for
outliers you often want to know who they are,
| | 02:00 | I am going to take the State Code
variable, right here, and put that in here to
| | 02:05 | Label Cases by, and that's all I need to do.
| | 02:09 | Now, I click OK and what we get is the
syntax pasted at the top and then we have
| | 02:15 | what's called a Case Processing Summary.
| | 02:17 | It's simply SPSS telling me how many
cases it used, that we had valid data on all
| | 02:23 | 51 cases, which is convenient.
| | 02:25 | And then below that is the actual chart.
| | 02:27 | Now this is a very busy chart and I am
going to show you there is a couple of
| | 02:31 | ways that we can clean this up and
make it even easier to deal with.
| | 02:34 | I am going to double-click on it and
the first thing I am going to do is I am
| | 02:38 | going to transpose the chart and turn
it sideways by going to the upper-right
| | 02:43 | and clicking on this button that
says Transpose chart coordinate system.
| | 02:47 | From there, I can change
various elements of the chart.
| | 02:50 | I am going to change the colors by
double-clicking on those and I will just
| | 02:55 | change them to something else.
| | 02:58 | Also, I am going to change the markers
for the outliers and I will make them a
| | 03:03 | little smaller and I will put them
in the same fill and apply those.
| | 03:09 | I will do the same thing for Utah over
here, except that's nearly invisible now.
| | 03:18 | I will use a darker one. There we go!
| | 03:24 | Okay, then I'll make the text over here
slightly larger and what I can see from
| | 03:32 | here is that each of these variables
was designed by Google to be centered
| | 03:38 | around 0 because that's the national average.
| | 03:41 | What it's showing us is states that
are above or below the national average.
| | 03:45 | We see for instance that Washington D.C.
is an outlier on several of them, for
| | 03:51 | Totally Lost, for Data Visualization,
and for Statistically Significant as well
| | 03:55 | as Regression and SPSS.
| | 03:59 | We can see that there is only one low outlier
anywhere, and that's Arkansas on American Idol.
| | 04:05 | Finally, the furthest outlier we have
on anything is on Modern Dance and it's
| | 04:12 | Utah, which is over 5 standard
deviations above the national average which is
| | 04:16 | pretty extraordinary.
| | 04:17 | Anyhow, you can see that a side-by-side
boxplot gives a quick and a compact way
| | 04:23 | to look at the distributions of
several scale variables at once.
| | 04:27 | You can check for outliers. You can
also use them as presentation graphics.
| | 04:31 | It's a handy alternative to multiple
histograms and you should always consider
| | 04:35 | the side-by-side boxplots when you
have several scale variables that you want
| | 04:39 | to analyze together.
| | Collapse this transcript |
|
|
8. Descriptive and Inferential Statistics for Two VariablesCalculating correlations| 00:00 | Whenever you explore your data
you'll find that each step can build on
| | 00:04 | the others before it.
| | 00:06 | In this course for example we
started by looking at individual variables
| | 00:10 | before looking at pairs of variables and
that comes before looking at sets of variables.
| | 00:15 | When we looked at individual
variables we started by creating graphic
| | 00:19 | displays for each variable.
| | 00:21 | Then by computing descriptive
statistics for each and finished with
| | 00:24 | inferential statistics.
| | 00:25 | There is a logical progression to this
and it's one that we will follow here
| | 00:30 | with the associations for pairs of
variables and later for sets of variables.
| | 00:35 | The first procedure that we are
going to look at, correlations, is the most
| | 00:39 | general measure of
association between pairs of variables.
| | 00:42 | Let's look at how to do correlations in
SPSS and how to interpret the results.
| | 00:46 | For this example, I'm going to be using
the same dataset I've used in the last few.
| | 00:51 | It's about the Google Searches,
Searches.sav, and to get correlations we need to
| | 00:57 | go up to Analyze and then we come down
to Correlate, and what we are going to be
| | 01:02 | doing is the basic version called
Bivariate or two variable correlations.
| | 01:07 | All you need to do here is take all
the variables that you want to correlate
| | 01:10 | with each other and put them in
the variable list on the right.
| | 01:15 | Now if there is one variable in
particular that can serve as an outcome
| | 01:18 | variable, it's helpful to put that one in
first so it shows up at the very top of the list.
| | 01:24 | In this particular example I thought
it might be interesting to look at the
| | 01:27 | relative interest in searching for Facebook.
| | 01:30 | So I am going to put that in first,
and then I'll see how that compares with
| | 01:34 | other search terms by selecting all
of these, and I might as well put in
| | 01:38 | nearly everything here.
| | 01:41 | I am going to come down to Median Age, because
all of these are either scale or dichotomous.
| | 01:48 | Now I am not going to put in Census
Bureau Region because that has four
| | 01:52 | categories and Census Bureau
Division because it has even more.
| | 01:56 | However, you can use indicator
variables and what I've done is I've created
| | 02:00 | three indicator variables.
| | 02:02 | One for whether a state is in the
Northeast, another for the Midwest, and a
| | 02:06 | third for the South, and what that does is
it leaves implied in all of these is the West.
| | 02:12 | So I am going to add the three
of those and put them over here.
| | 02:17 | Now I have a few options with correlation.
| | 02:19 | I can get three different kinds of correlations.
| | 02:22 | There is the Pearson Product-Moment
Correlation coefficient which is the
| | 02:25 | standard correlation, also sometimes
known by its symbol R. There's Kendall's
| | 02:30 | Tau-b and there is the Spearman
rank order correlation coefficient.
| | 02:34 | Truthfully, I've never had to do with
anything other than the Pearson and I
| | 02:38 | recommend that you stick with that one.
| | 02:39 | There's also Test of Significance.
| | 02:43 | You can do what's called a one-
tailed test or a two-tailed test.
| | 02:47 | Now this has to do with calculating
false positive rates and I recommend that
| | 02:52 | you always stay with a two-tailed test
unless you have some super-compelling
| | 02:56 | reason to go with the one-tailed.
| | 02:59 | Also, we have the option of flagging
statistically significant correlations.
| | 03:02 | That's very helpful and I'd leave that on
there, and let's come over here and take
| | 03:06 | a quick look at the other options.
| | 03:09 | You can also get means and standard
deviations for each variable, but we don't
| | 03:13 | need that at this point, because
we should have done that already.
| | 03:16 | You can get what are called cross-
product deviations and covariances and that's
| | 03:19 | a little technical and we don't need that.
| | 03:22 | The other question is whether you want
to exclude cases pairwise or listwise.
| | 03:26 | I've mentioned these before.
| | 03:28 | Pairwise means that you might have a
different sample size for each set of
| | 03:32 | correlations. If for instance everybody
has data on two particular variables, but
| | 03:38 | you're missing a lot of information on
another variable, you would end up with
| | 03:41 | different sample sizes.
| | 03:43 | This isn't necessarily a problem
and I usually leave it at pairwise.
| | 03:46 | However, there may be times when you
only want to deal with cases with complete
| | 03:51 | information, in which case
you would choose listwise.
| | 03:53 | But I am going to leave it
at the default for right now.
| | 03:55 | So I'll press Continue and I'll press OK.
| | 03:58 | Now I asked for a lot of variables and
so what I get here is a very large table.
| | 04:03 | You can see that it goes down a
long way and it goes across a long way.
| | 04:08 | You can also tell that the labels aren't
there and when we scroll down it's hard to see.
| | 04:12 | But that's okay, and what you see here is
that every variable is listed down this side.
| | 04:19 | We have Facebook to SPSS to Regression
as Google Searches, and we have the same
| | 04:23 | variables listed across the top:
Facebook, SPSS, Regression, and so on.
| | 04:28 | Then what you have is a cell
that gives information about the
| | 04:31 | association between each one.
| | 04:33 | In each cell the top number
is the Pearson correlation.
| | 04:37 | That's the actual correlation coefficient.
| | 04:39 | It goes from 0 to 1 and 0 means no
linear relationship and 1 indicates a perfect
| | 04:46 | linear relationship.
| | 04:47 | It can be positive or negative.
| | 04:50 | The positive or negative has nothing to
do with the strength of the relationship.
| | 04:53 | It only indicates whether it's an
uphill or downhill relationship.
| | 04:57 | The second number it says Sig. Two-tailed.
| | 05:00 | This is the probability value that's
associated with the significance test for
| | 05:04 | the correlation, and the third one is
the N or the number of cases that go into
| | 05:10 | calculating that particular correlation.
| | 05:13 | This dataset has complete data for all 51 cases.
| | 05:16 | That's the 50 states in Washington, D.C.
| | 05:19 | Additionally, you see that down the
diagonal we have a series of 1s and blanks and 51s.
| | 05:26 | That's because it's each variable
correlated with itself which will always be a
| | 05:30 | perfect positive correlation, and
truthfully some programs just don't put
| | 05:34 | anything there at all.
| | 05:35 | But let's say I'm interested in the
relative interest in each state in
| | 05:41 | searching for Facebook.
| | 05:43 | Then what I want to do is I
want to go down this first column.
| | 05:46 | It says Facebook at the top and I want
to scroll down and I want to look for
| | 05:49 | statistically significant correlations.
| | 05:52 | Now SPSS makes this easy, because
they will put asterisks next to
| | 05:56 | statistically significant correlations.
| | 05:58 | So you see for instance the top
is Facebook correlated with itself.
| | 06:02 | That doesn't really mean anything.
| | 06:03 | Facebook and SPSS have a correlation of -.184.
| | 06:08 | It's not a very strong correlation.
| | 06:10 | It's closer to 0 than it is to
+ or -1 and you can tell that its
| | 06:14 | probability value is .196.
| | 06:15 | It's nowhere close to a
statistically significant.
| | 06:19 | However, we do see that in the next
few we have statistically significant
| | 06:24 | negative correlations.
| | 06:25 | The higher a state's interest in
Facebook the lower its interest in searching on
| | 06:31 | Google for regression or statistically
significant or business intelligence.
| | 06:35 | We can scroll down and see some more.
| | 06:37 | Similarly, lower interest in data
visualization, they're also less likely to use
| | 06:42 | the term totally lost.
| | 06:44 | On the other hand, states that show a
relatively high interest in Facebook also
| | 06:48 | show a relatively high interest
in searching for American Idol.
| | 06:52 | That's the correlation of .516 and as
that probability value of 000 is not
| | 06:58 | actually a 0, but it means that
it rounds off to less than 001.
| | 07:03 | As we scroll down we see that
modern dance goes into it and NBA.
| | 07:07 | Interestingly, NFL does not
correlate, but the NBA and FIFA do.
| | 07:13 | Also, as we scroll down we can see
that states that have an NFL team show a
| | 07:18 | lower interest in Facebook,
similarly for an NBA and MLS.
| | 07:22 | It's just as whole series of
correlations that show things that can be used to
| | 07:26 | predict the level of
interest in a particular item.
| | 07:30 | Now the most important thing probably
to remember here is that correlations are
| | 07:35 | simply associations.
| | 07:36 | They don't explain why the
variables are associated.
| | 07:39 | It's simply a predictor.
| | 07:41 | The matter of explaining why they are
correlated is a whole different issue
| | 07:45 | about causation and something
that we need to be careful about.
| | 07:49 | So in summary, correlations are great way
to look at the strength of associations
| | 07:53 | between two variables.
| | 07:55 | The correlations of general purpose
they can be used with scale variables,
| | 07:58 | ordinal variables or dichotomous
variables, and they can give a good way to
| | 08:02 | compare associations
across a number of procedures.
| | 08:05 | For that reason it's a good idea to
always include correlations in your analyses.
| | 08:10 | However, there are also some more
specialized procedures that are helpful to use
| | 08:14 | and we will turn to those next.
| | Collapse this transcript |
| Computing a bivariate regression| 00:00 | In the last movie, we use correlations
to look at the strength of association
| | 00:05 | between two variables.
| | 00:06 | However, correlations are standardized measures.
| | 00:10 | That is, they don't
involve a unit of measurement.
| | 00:12 | It's not a correlation of
0.78 meters or anything.
| | 00:16 | It's just a correlation of 0.78.
| | 00:19 | And what that can be really handy,
because it makes it easier to compare
| | 00:22 | associations across different kinds of
variables, it can also be really nice
| | 00:26 | to put the association
back into the original metric.
| | 00:30 | To do that we'll look at another
procedure that's very closely related to
| | 00:33 | correlation and that has many of its
advantages, but that also uses the original
| | 00:38 | units of measurement.
| | 00:39 | That is bivariate linear regression.
| | 00:42 | As a note SPSS has a wonderful new
procedure called Automatic Linear Modeling
| | 00:47 | that also performs linear regression
which we'll cover a little bit later.
| | 00:51 | For now though, it makes more sense to
stick to the standard linear regression,
| | 00:54 | because we're only using one predictor
variable and automatic linear modeling
| | 00:58 | seems to a little like overkill for that.
| | 01:01 | And second, automatic linear modeling
does an awful lot of work behind the
| | 01:05 | curtains and it's kind of nice to
keep things visible for right now.
| | 01:08 | As that in mind here's how to do a
bivariate linear regression in SPSS.
| | 01:14 | For this example, we'll be using the
Google Search data again, Searches.sav,
| | 01:17 | where we will be using the
percentage of people in a state with bachelo'rs
| | 01:22 | degrees or higher as a way of
predicting the relative level of interest in
| | 01:27 | Facebook as a Google Search topic.
| | 01:30 | To do this we go first to Analyze and
then we come down to Regression and we go
| | 01:36 | to the second one down, Linear.
| | 01:39 | We need to take our outcome variable,
that is the thing we're trying to predict,
| | 01:43 | and put it in the Dependent box.
| | 01:45 | This means dependent variable or the
variable that depends on other variables.
| | 01:49 | In this case, that's going to be
Facebook, that is Facebook as a relative
| | 01:55 | interest in Google searches.
| | 01:57 | Independent is the variables that
we're going to use as predictors, in this
| | 02:01 | particular case I'm going to be
using the Percent of Population with a
| | 02:05 | bachelor's degree or higher.
| | 02:08 | Now the linear regression command is
actually tremendously sophisticated and
| | 02:12 | gives tons of options.
| | 02:14 | None of which I'm going to use at this
particular moment. I'm doing the simplest
| | 02:18 | possible version here of simply using
the Percent of Population with bachelors
| | 02:24 | degree or higher to predict
Facebook interest on Google Searches.
| | 02:27 | And I'm going to do nothing else at this
moment. All I'm going to do now is press OK.
| | 02:33 | And I get a table that tells me the
percent of population with a bachelor's
| | 02:37 | degree or higher and that is using
Facebook interests as a dependent variable.
| | 02:42 | The next table down gives me an
indication of the association. We have a
| | 02:47 | correlation here of 0.644.
| | 02:49 | That's the R. Now to capital R here,
because that actually stands for multiple
| | 02:53 | correlation which means you can use several
variables to correlate with a single outcome.
| | 02:58 | Although in this case we only have
two variables so it's still bivariate.
| | 03:01 | And then you have another one here
that's called R Square and that is that the
| | 03:05 | 0.415 is the square of the
number next to it, the 0.644.
| | 03:10 | And the reason you do this is because
you can't really compare correlation
| | 03:15 | coefficients. They are not linear.
| | 03:17 | A correlation of 0.4 is not twice as
strong as a correlation of 0.2, even though
| | 03:22 | the number is twice as big.
| | 03:24 | Instead, if you square them then you
get numbers that are directly comparable
| | 03:29 | and a correlation of 0.4 squared
becomes 0.16 and a correlation of 0.2
| | 03:33 | squared becomes 0.04.
| | 03:36 | And so the other correlation is
actually four times as strong.
| | 03:39 | You also have something
called Adjusted R Squared.
| | 03:41 | Sometimes people report R Squared,
sometimes they report Adjusted R Squared.
| | 03:45 | An Adjusted R Squared changes the
number according to the ratio of
| | 03:50 | observations to predictors.
| | 03:52 | We also have the Standard Error
of the Estimate that goes into the
| | 03:55 | probability values.
| | 03:58 | And the next table is the ANOVA or
ANOVA table. That's short for analysis of
| | 04:02 | variance and it's an indication of the
statistical significance of the model as a whole.
| | 04:07 | If we had more than one predictor then
this would be an important thing, but
| | 04:11 | because we have only one predictor and
we know it's statistically significant it
| | 04:14 | doesn't really tell us anything extra right now.
| | 04:17 | The next one down from that is
coefficients, and what we see here is the slope
| | 04:23 | and the intercept that we are
familiar with from charting relationships.
| | 04:28 | The Unstandardized Coefficients are the
slope in the intercept in original units.
| | 04:33 | And so what we see is if we're
trying to predict the level of interest in
| | 04:37 | Facebook on a state-by-state basis
we have an intercept here of 3.240.
| | 04:44 | That says give everybody an interest
of three standard deviations above the
| | 04:48 | mean, but then for every percentage
of the population that has a bachelors
| | 04:53 | degree or higher, subtract a tenth of
a point from that. That's the -0.119.
| | 05:00 | And that means it's a downhill.
| | 05:02 | The higher the level of education,
the lower the interest in Facebook as a
| | 05:06 | Google search term.
| | 05:07 | This will become clearer if I quickly
make a scatterplot of the association
| | 05:11 | between the two variables.
| | 05:12 | I've already shown how to make
scatterplot, so I'm going to go through this
| | 05:15 | a little bit quickly.
| | 05:16 | I come to Graphs to Chart Builder to
Scatter, where I'm going to put level
| | 05:24 | of education here in the X, and I'm
going to put Facebook here in the Y and
| | 05:30 | I'll just click OK. And it's clear.
| | 05:34 | It's a very strong negative association.
| | 05:37 | The higher the percentage of the
population with a bachelors degree, the lower
| | 05:41 | the relative interest in
Facebook as a search term.
| | 05:45 | So the similarities between bivariate
correlation and bivariate regression, which
| | 05:50 | we just did, are pretty
easy to see in this example.
| | 05:53 | They both give the same
standardized effects and the same P values.
| | 05:57 | The difference is that the regression
model also gives the intercept and slope
| | 06:01 | for the model which is a
nice piece of information.
| | 06:04 | Also in a later section we'll see
how this procedure can be very easily
| | 06:09 | adapted to having several predictor
variables, in which case it's called
| | 06:12 | Multiple Regression.
| | 06:14 | And while it's possible to use
categorical predictors in linear regression,
| | 06:18 | the basic approach doesn't work well when
the outcome variable is categorical.
| | 06:22 | Instead, it's more common to use cross
tabulations, which we'll turn to next.
| | Collapse this transcript |
| Creating crosstabs for categorical variables| 00:00 | In the last two movies we looked at ways to
assess the relationships between two variables.
| | 00:05 | We looked at correlations, which work for
pretty much any kind of variable, and we
| | 00:10 | looked at bivariate linear regression, a
closely related procedure, but one that
| | 00:14 | doesn't work with categorical outcome variables.
| | 00:16 | If you do have a categorical outcome
variable and a categorical predictor, you
| | 00:21 | can still use correlations as long
as those variables are coded as 01
| | 00:25 | indicator variables.
| | 00:27 | But it's more common to use what's called
a crosstabulation or crosstab for short.
| | 00:31 | This is simply a table with rows and
columns that crosses, hence the name
| | 00:36 | crosstabulation, the combinations
of categories in the two variables.
| | 00:41 | Each box or cell in the table simply
indicates how many people have that
| | 00:44 | particular combination of the two categories.
| | 00:48 | To do this example, I'm going to use
the GSS dataset and I'm going to show the
| | 00:53 | relationship between marital status
in this particular dataset and overall
| | 00:58 | levels of happiness.
| | 01:00 | To do this, I first come up to
Analyze, to Descriptive Statistics.
| | 01:04 | Now this one right here, Tables,
refers to Custom Tables, which is a separate
| | 01:08 | add-in that you pay for in SPSS.
| | 01:10 | But the one that comes standard in
everything is right here under Descriptive
| | 01:14 | Statistics, to Crosstabs.
| | 01:16 | That's the one I'm going to use in this example.
| | 01:18 | All I need to do is specify the variables
that I want to depict the rows and the columns.
| | 01:24 | In this particular example, I'm going
to use Married to separate the rows, so
| | 01:31 | those will be the ones going across.
| | 01:33 | The columns, which I'll use for my
outcome variable, is going to be the indicator
| | 01:37 | of happiness, and that is
near the bottom of the dataset.
| | 01:40 | It's this one called Self-rated Happiness.
| | 01:44 | I'm going to drag that up to the columns.
| | 01:48 | Now if I do this, it will simply give me the
number of people who fall into each category.
| | 01:52 | There are generally a
couple of things I want to add.
| | 01:56 | The first one is under Statistics.
| | 01:59 | I want to add a measure of association
for this with something called a Chi-square.
| | 02:04 | I click on that.
| | 02:06 | That's a statistic that
shows changes in distribution to
| | 02:09 | cross-categorical variables. Press Continue.
| | 02:13 | The next one is what numbers I
actually want to have in the cells.
| | 02:16 | Now sometimes the two groups, like for
instance Married and Not Married, can be
| | 02:21 | very different sizes in which case
it's hard to compare the raw frequencies.
| | 02:25 | Instead what I might want to do is
break down the percentages so I know what
| | 02:29 | percentage of people who say they're
married, say they're not too happy, or
| | 02:34 | pretty happy or very happy.
| | 02:36 | And the easiest way to do that is with
what's called a Row Percentage, because I
| | 02:40 | want to get the percentage of people
going across who fall into each column.
| | 02:45 | Now if I have my data organized
differently, I might want column percentages,
| | 02:48 | where I look at the percentage of people in
each column who fall into particular rows.
| | 02:53 | Either way. In this one I
just want to use a row percentage.
| | 02:56 | So I'm going to press Continue
now and then I'll just press OK.
| | 03:02 | And what I have here first
is the Case Processing Summary.
| | 03:06 | This tells me that we had
complete data from 349 people.
| | 03:10 | Now I actually have complete
data on these particular variables.
| | 03:13 | If any of my cases were missing a value
on one or the other of these variables,
| | 03:18 | they wouldn't be included.
| | 03:19 | So crosstabs only work with complete data.
| | 03:22 | This next table is the crosstabulation
itself and what we have on the left is
| | 03:27 | that says whether people reported that
they were married or not married, so it's
| | 03:31 | married yes and no.
| | 03:33 | Across the top we have self-rated
happiness with not too happy, pretty
| | 03:36 | happy, and very happy.
| | 03:38 | And what we see at the end of that is
the totals, so there is a 170 people who
| | 03:43 | were married and 179 were not married.
| | 03:46 | It's coincidental that we have
very close numbers on these ones.
| | 03:50 | And what you can see as we go across
is the percentage of people who were
| | 03:54 | married, who said for instance
they were very happy, was 44.7%.
| | 03:59 | That's 76 people out of 170.
| | 04:02 | On the other hand of the people in this
dataset who were not married, 44 of them
| | 04:06 | said that they were very happy, which
is 24.6%, so it's a lower percentage.
| | 04:11 | The percentages of people who said
they were pretty happy are close to each
| | 04:14 | other for the two groups, 51.2% for those who
are married, and 55.3% for those who weren't.
| | 04:21 | And the percentage of people who
are not too happy changes also.
| | 04:25 | We have 4.1% of the people who are
married so they weren't too happy and 20.1%
| | 04:30 | of the people who weren't
married and say they weren't too happy.
| | 04:33 | The last table is called the Chi-Square Text.
| | 04:36 | That's the inferential statistic here
and we're looking at the top one that says
| | 04:40 | Pearson Chi-Square. The actual
value of the test statistic is 28.653.
| | 04:47 | The next number is what's called the
degrees of freedom and it has to go into
| | 04:50 | the calculations of the probability levels.
| | 04:53 | It has 2 degrees of freedom in this case.
| | 04:55 | And this third number is the
asymptotic significance level of 2-sided.
| | 04:59 | That's the probability level
that goes into the hypothesis test.
| | 05:03 | In this case, it shows up as .000.
| | 05:06 | It's not actually 0 all the way through,
but it's a number that is smaller than .001.
| | 05:11 | And what this shows us is that the
distribution of self-rated happiness is
| | 05:15 | different for the two groups
on the marital status variable.
| | 05:19 | It's important to remember again,
this is simply showing a correlation of
| | 05:23 | self-reported variables.
| | 05:25 | And why there might be an apparent
association between these two is a whole
| | 05:29 | different issue, but that's
true of any measure of association.
| | 05:33 | And so a crosstabulation is a great
way to show the relationship between two
| | 05:37 | categorical variables.
| | 05:39 | By selecting the row or column
percentages, you can make it easier to
| | 05:42 | compare the groups.
| | 05:44 | And the chi-square inferential test
lets you know whether any differences you
| | 05:48 | see are large enough to
become statistically significant.
| | 05:51 | And again, it's worth remembering that
if your categories are dichotomies with
| | 05:54 | only two groups, like yes/no or
male/female and if the variables are coded as 01
| | 05:59 | indicator variables, then you can also
get a correlation coefficient for the
| | 06:03 | association that will have the
same result on the significance test.
| | 06:07 | That is, it'll have the same
probability value and the same result in terms of
| | 06:11 | rejecting or retaining the null hypothesis.
| | 06:14 | However, the row and column
percentages are a nice perk of the crosstabs
| | 06:18 | procedure and in any case, if your
variables have more than two categories, then
| | 06:22 | you would want to do the
crosstab and Chi-square anyhow.
| | 06:25 | And with that in mind, the next several
movies will address ways to investigate
| | 06:30 | the mean scores on scale
variables for different groups.
| | Collapse this transcript |
| Comparing means with the Means procedure| 00:00 | In the last few movies we've
discussed a few different ways to look at the
| | 00:04 | association between pairs of variables.
| | 00:06 | We looked at the correlation
coefficient, which is an excellent general purpose tool,
| | 00:10 | and we looked at bivariate
regression which works really well when your
| | 00:13 | outcome variable is a scale variable.
| | 00:16 | We also looked at cross tabulations for
when you have two categorical variables.
| | 00:20 | But another very common situation is
when you want to compare the means of two
| | 00:23 | or more groups, or one group
at more than one point in time.
| | 00:27 | Although it's possible to do this with
correlations and regression, if you go to
| | 00:30 | Group Membership as 01 indicator
variables, it's often easier to use specialized
| | 00:35 | procedures for comparing
group means for a few reasons.
| | 00:39 | First, they generally give you the
group means along with the inferential tests
| | 00:43 | and maybe even charge for the mean.
| | 00:44 | So you can get more done on a single command.
| | 00:47 | Second, these procedures often provide
explicit tests for the assumptions behind
| | 00:51 | the tests, such as the groups
having equal spread in their scores.
| | 00:55 | Third, the test statistics that they
give, often the t-test or an analysis of
| | 01:00 | variance, depending on which procedure
you use, are the most common statistics
| | 01:04 | for group comparisons, and so they
may be more familiar to more people.
| | 01:09 | Now one of the recent additions to
SPSS is the flexible means procedure.
| | 01:13 | What's nice about this is that
previously you had to choose different tests
| | 01:17 | if you're comparing two groups or if you are
comparing the means of more than two groups.
| | 01:23 | And we will in fact cover these
procedures in the next few movies.
| | 01:26 | The means procedure on the other hand can
handle either situation, and let's see how it works.
| | 01:32 | For this example, I'm going to be using
the GSS dataset, General Social Survey
| | 01:36 | that I've used before.
| | 01:37 | And to compare means, I need to come
up to Analyze, to Compare Means, then I
| | 01:43 | choose the first one, Means.
| | 01:45 | And from here I need to choose the
variable that I want to look at as a
| | 01:49 | dependent or the outcome
variable, the thing that I think group
| | 01:52 | membership affects.
| | 01:54 | In this particular case, I'm
going to use Family Income.
| | 01:56 | So I can click that and I can drag it up there.
| | 02:00 | Then I need to look at the Independent list.
| | 02:03 | Those are the variables that I think
will be associated and produce changes or
| | 02:08 | simply be associated with family income.
| | 02:11 | In this particular one, I'm going to
choose a cultural variable, I'm going to
| | 02:15 | scroll down here, and I'm going to
choose whether a person attended a dance
| | 02:20 | performance in the last year.
| | 02:21 | I'll click and move that
into the Independent list.
| | 02:23 | Then I'm going to come up to Options
and I have the possibility here of getting
| | 02:29 | the huge amount of statistics, including
some relatively esoteric things like the
| | 02:34 | harmonic mean and the geometric mean.
| | 02:36 | The mean, the number of case, and the
standard deviation on the other hand are
| | 02:39 | good default, though I'd like to
have them in slightly different order.
| | 02:42 | So what I'm going to do is I'm
going to click to get these out, just
| | 02:46 | double-clicking, and then I'll bring
them back in with a number of cases first
| | 02:50 | and then the mean and
then the standard deviation.
| | 02:53 | Also I'm going to come down to the
bottom here where it says Statistics for
| | 02:57 | First Layer and check the
first box for Anova table and eta.
| | 03:02 | Anova is short for Analysis of Variance
and it will give me an inferential test
| | 03:07 | about whether the means for the groups differ.
| | 03:10 | And eta is similar to the correlation
coefficient except it can be used when
| | 03:14 | there is more than two groups.
| | 03:16 | So I'm going to select that one and
I'm going to press Continue and then
| | 03:20 | I'll press OK again.
| | 03:23 | And what I get is several tables that show up.
| | 03:26 | The first table is the Case Processing
Summary and it lets me know that I had
| | 03:30 | complete data for all 349 cases
in the dataset, so that's good.
| | 03:34 | The second table labeled Report gives
me the actual statistics, the descriptive
| | 03:39 | statistics for my two groups on family income.
| | 03:42 | So for instance, we see that there were
273 people who had not attended a dance
| | 03:48 | performance in the previous year and
their average family income was about
| | 03:52 | $29,000 with a standard
deviation of almost $26,000.
| | 03:57 | On the other hand, there were 76 people
who had attended the dance performance
| | 04:01 | in the last year and their average
income for the family was nearly $47,000,
| | 04:05 | so that's much higher.
| | 04:08 | And they had a standard
deviation of about $36,000.
| | 04:12 | So you can see there is a very
substantial difference there in the means,
| | 04:16 | although the standard
deviations are also rather large.
| | 04:19 | The next table that says ANOVA table or
ANOVA table for Analysis of Variance is
| | 04:25 | the inferential test to let us know
whether these two means differ statistically
| | 04:28 | significantly from each other.
| | 04:30 | The important number here is in
the very last column under Sig.
| | 04:34 | That's the probability level or the
significance level of this particular
| | 04:38 | result and it says .000.
| | 04:40 | It's not literally 0.
| | 04:42 | It simply is less than .001.
| | 04:45 | And this tells us that there is a
statistically significant difference
| | 04:49 | between these two means.
| | 04:50 | On the other hand, there's also the
question of how big is the effect and
| | 04:54 | that's what we get from the fourth
table that says Measures of Association.
| | 04:58 | It looks at the association and
gives us a statistic called eta.
| | 05:02 | And that is a version of the
correlation or analogous of the correlation that
| | 05:06 | can be used when there's
even more than two groups.
| | 05:09 | Now our value here is .252.
| | 05:10 | Eta, like the correlation
coefficient, goes from 0 to 1.
| | 05:16 | And here we see that it's not terribly
high but it is above one and the Eta
| | 05:21 | Squared is an indication of how much of
the variance in the family income can be
| | 05:26 | explained by group membership, by
having knowing whether a person attended a
| | 05:30 | dance performance in the last year or not.
| | 05:32 | And here we see it's .064.
| | 05:34 | That can be read as a proportion as 6%.
| | 05:37 | So what we see is that there is a
statistically significant difference in the
| | 05:41 | means between the two groups.
| | 05:43 | It's not huge because the standard
deviations are large, but it does let us know
| | 05:47 | that there is an association, that
people who saw dance performances generally
| | 05:52 | had higher family incomes than people
who had not attended dance performances
| | 05:57 | for whatever reason that might be.
| | 05:59 | So the means procedure is a handy way
to compare the means of any number of
| | 06:03 | groups on any number of variables.
| | 06:05 | Not only does it give the descriptive
statistics and an inferential test,
| | 06:09 | it also gives a measure of association.
| | 06:11 | This makes the means procedure a
flexible and easy way to get a lot of
| | 06:15 | tests done quickly.
| | 06:17 | In the next two movies, we'll look at
the specialized procedures for comparing
| | 06:20 | the means of two groups or two or more
groups, each of which may provide some
| | 06:25 | information and options that
aren't available in the means procedure.
| | 06:28 | So they may be more useful for
you as you explore your own data.
| | Collapse this transcript |
| Comparing means with the t-test| 00:00 | In the last movie we looked at the
general purpose means procedure, which is a
| | 00:04 | recent addition to SPSS's
bag of the analytic tricks.
| | 00:08 | That procedure allowed us to compare
for example, the means of two groups
| | 00:11 | on scale variables.
| | 00:13 | However, SPSS also has a
specialized procedure for this comparison
| | 00:17 | that's been around since version 1.0
as mainframe and punch card days. That is
| | 00:21 | the Independent Groups T-Test.
| | 00:23 | It's called the Independent Groups
because it's comparing the means of two
| | 00:27 | different groups as opposed to for
example, the means of the same group on two
| | 00:31 | different variables or two different
points in time which we will cover later.
| | 00:35 | Because this procedure gives a few
more pieces of information than the means
| | 00:38 | procedure does, we will
take a close look at it too.
| | 00:41 | For this example, I am going to use the
same GSS data set and the same variables
| | 00:46 | that I did in the last one, when we
look at the means procedure, so you can
| | 00:49 | compare the results of the two of them directly.
| | 00:52 | To compare the means with the
Independent Means T-Test I go to Analyze, come
| | 00:57 | down to Compare Means, and I go to the third
choice which is the Independent Samples T-Test.
| | 01:04 | From there, I need to pick the Test
Variable, those are the ones that I am
| | 01:08 | looking at for outcomes.
| | 01:10 | In this particular case, I am
going to look at Family Income.
| | 01:13 | You can see, however, that I
can do a lot more at once.
| | 01:18 | Then I have the Grouping Variable,
sometimes called the independent variable or
| | 01:21 | the predictive variable.
| | 01:23 | It's the groups with the different means.
| | 01:25 | In this case, I am going to use
the Dance Performance question.
| | 01:28 | So I can click that and I
move it into Grouping Variable.
| | 01:30 | However, with this procedure, I need to
explicitly tell SPSS what the codes are
| | 01:36 | for the two different groups.
| | 01:37 | So I click on Define Groups and I
tell it that I am using a 0 and a 1.
| | 01:43 | Now the interesting thing about this
is that it means that if you have more
| | 01:46 | than two groups, you could select
two at a time to compare them here.
| | 01:51 | Also, if you are using a scale variable as
your predictor, you can select a cut point.
| | 01:56 | For instance, people above 7 on a 0 to 10 scale.
| | 02:00 | But I am just going to put in that this is
a 01 indicator variable and I will Continue.
| | 02:05 | Under Options, it asks me what
Confidence Interval Percentage I want to use.
| | 02:10 | 95% is the default and it's used when
you have at least reasonably large samples.
| | 02:16 | It may be if you have a small sample,
that you would want to use a smaller number
| | 02:20 | like 90% or maybe even 80%
but generally we stay with 95%.
| | 02:25 | Also there's a question about whether
I want to Exclude cases, analysis by
| | 02:29 | analysis, that's if I had several
variables I was looking for Group Differences
| | 02:33 | on and that means that if they were
missing it on, for instance, the first
| | 02:37 | variable, they wouldn't be included
there but they would be included by other
| | 02:40 | ones, or whether I want to exclude cases
list wise, which means if they're missing
| | 02:44 | the score on any of the
variables, they get left out entirely.
| | 02:48 | That gives you consistent
sample size across tests.
| | 02:51 | Now I'm only doing one outcome variable.
| | 02:53 | So it would give the exact same thing
anyhow. I am just going to leave it as default.
| | 02:57 | I will press Continue, then I will press OK.
| | 03:02 | And what I have here are a couple of tables.
| | 03:04 | The first one is the Group Statistics.
| | 03:07 | Now this is the same as what
we saw in the Means procedure.
| | 03:10 | This tells means that 76 people said
they saw a dance performance in the last year
| | 03:13 | and that their average income, their
Mean, was about $47,000 with a Standard
| | 03:19 | Deviation of 36,000.
| | 03:21 | On the other hand, we have a new column
here down it's called the Standard error
| | 03:25 | of the mean and that actually is the
standard deviation divided by the square
| | 03:29 | root of the sample size. But it's
something that's used as part of the
| | 03:33 | inferential procedure.
| | 03:34 | So we usually don't need to
deal with that one directly.
| | 03:37 | The second table, it says Independent
Samples Test and this is where we have the
| | 03:41 | inferential procedure. What's
interesting though about doing this command in
| | 03:45 | SPSS is that it actually gives us two
procedures. The first one, in the Columns
| | 03:50 | it says, Levene's Test
for Equality of Variances.
| | 03:53 | This is a specific test for an
assumption for a valid t-test and the idea here
| | 03:59 | is that their groups shouldn't be too
different from each other in how spread
| | 04:02 | out their scores are.
| | 04:04 | And what we see here in the top
table is that the one group had a
| | 04:08 | standard deviation of 36,000 and the other
group had a standard deviation of about 26,000.
| | 04:14 | And what the Levene's test tells us is
that these two groups do not have equal
| | 04:18 | variances, which are related
to the standard deviation.
| | 04:21 | As such I really shouldn't use a
standard t-test, which is the one across the top;
| | 04:27 | and instead I should use one that has
something called Fraction of Degrees of
| | 04:31 | Freedom and that's the one on the second row.
| | 04:33 | One the other hand, they give
functionally the same output.
| | 04:38 | Let's look at this test.
| | 04:40 | It says T-test for quality of means, and we
have three members. We have the T that's
| | 04:44 | the actual value of the test statistic,
then we have the Degrees of Freedom
| | 04:48 | which is used in calculating the
probability value. The third one, Sig (2-tailed),
| | 04:53 | is the actually probability value
and the result of the inferential test.
| | 04:57 | In both cases, it comes out as 000.
Again it's not literally zero.
| | 05:02 | It's just is less than 001.
| | 05:04 | So, regardless of which test we use, we
find that there is a highly significant
| | 05:09 | difference in the means between these
two groups. And if I scroll over to the
| | 05:13 | right a little bit, I can see the rest
of this table and what it does is it's
| | 05:17 | giving me a 95% confidence interval of
the difference between the two groups.
| | 05:23 | And you could see, it's slightly different
for these two verses of the T-Test but in
| | 05:27 | either case, we have a
large difference in the means.
| | 05:30 | It's about $18,000 and the confidence
intervals are somewhere between $9,000 and
| | 05:36 | $27,000, difference between those who
say they have seen the dance performance
| | 05:41 | last year and those who haven't.
| | 05:43 | So the specialized procedure for
comparing the means of the two different
| | 05:47 | groups, the independent samples
t-test, it's a convenient test.
| | 05:51 | It provides a few extra options over
the general purpose means procedure and if
| | 05:56 | you have more than two groups you
may want to look at another specialized
| | 06:00 | procedure called the one-way analysis
of variance, which we will turn to next.
| | Collapse this transcript |
| Comparing means with a one-way ANOVA| 00:00 | In the last movie we looked at a
procedure to compare the means of two different
| | 00:04 | groups on a scale variable using what's
called the Independent Samples T-Test.
| | 00:09 | On the other hand, if you want to
compare the means of more than two groups, you
| | 00:12 | would want to use something called
the Analysis of Variance or ANOVA.
| | 00:17 | And although you can use ANOVA with
two group comparisons, and there's a simple
| | 00:21 | conversion formula between the ANOVA
results and the T-Test, it's more common to
| | 00:25 | reserve it for times when
you have three or more groups.
| | 00:28 | What the Analysis of Variance does is
look for any kind of difference between
| | 00:32 | the means of the various groups.
| | 00:34 | That might mean that Group A is
different from Group B is different from Group
| | 00:37 | C, or it might mean that A and B
together are different from Group C or any of
| | 00:43 | several other possible combinations.
| | 00:45 | For this reason you'll want to do a couple
of things when you do an Analysis of Variance.
| | 00:49 | First, you'll want to look at the group
means, such as with a bar chart of the
| | 00:53 | means to see if any natural groupings emerge.
| | 00:56 | Second, you'll want to do
something called a Post Hoc Test.
| | 00:59 | That's for after the fact.
| | 01:01 | That can tell you where the
differences specifically are.
| | 01:04 | We will look at both of these in this example.
| | 01:07 | For this demonstration I am going to
use the Google Searches information in
| | 01:10 | Searches.sav, and to get the Analysis of
Variance what we need to do is go up to
| | 01:16 | Analyze, to Compare Means, to what's
called the One-Way Analysis of Variance.
| | 01:22 | It's called One-Way because we're going
to use a single categorical variable or
| | 01:26 | factor to differentiate between the groups.
| | 01:28 | This is because there are other
versions of the Analysis of Variance where you
| | 01:31 | can have more than one categorical variable.
| | 01:33 | We have just one, so this is
the One-Way Analysis of Variance.
| | 01:37 | You can check more than one variable at a
time by putting it into the Dependent List.
| | 01:40 | These are the outcome variables
where you're looking for differences.
| | 01:44 | In this particular case I'm just
going to use one and I'm going to use the
| | 01:47 | relative interest in searching for the
NFL in Google, and I am going to look
| | 01:52 | for regional differences on that.
| | 01:54 | So I find the regions of the U.S..
| | 01:57 | that's Census Bureau Regions,
and I put that under Factor.
| | 02:00 | In the Analysis of Variance the
categorical variable is called a factor and the
| | 02:04 | categories within that
variable are called levels.
| | 02:07 | So we have four groups within the
Census Bureau Region, so we will have four
| | 02:11 | levels in the factor of region.
| | 02:13 | Now we come up and we check a few other things.
| | 02:15 | The first possibility is Contrasts.
| | 02:19 | Now, this is something that we can
ignore, because it's for specialized
| | 02:23 | comparisons,like changes over time
or mathematical combinations of group,
| | 02:27 | something called planned contrasts, and
we're not doing any of that so we can
| | 02:30 | just ignore this one for right now.
| | 02:31 | I will press Cancel.
| | 02:33 | The second one that we want to look at is
called Post Hoc, again for after the fact.
| | 02:38 | Now, we have a lot of choices here.
| | 02:41 | The most common choices are what are
called the Bonferroni and the Scheffe Tests.
| | 02:46 | They're common, but
statistically speaking, they're not perfect.
| | 02:49 | They tend to be a little over-
conservative and their output can be a little
| | 02:53 | complicated in SPSS.
| | 02:55 | For that reason, I prefer to
use a test called the Tukey test.
| | 02:59 | It's named after John Tukey,
the statistician, and it's full name is actually the
| | 03:02 | Tukey Honestly Significant Difference Test or
HSD Test, which is what you'll see in the output.
| | 03:07 | So I am going to click on the Tukey Test.
| | 03:10 | Then I will just come down and hit Continue.
| | 03:12 | Now let's take a quick
look at the other Options.
| | 03:16 | I click on the Options and I can get
Descriptive Statistics, which are helpful
| | 03:20 | for this kind of analysis.
| | 03:22 | I can also get a Means Plot.
| | 03:24 | It's a simple line plot, but it's
still helpful for looking at a graphical
| | 03:28 | representation of the
differences between the means.
| | 03:31 | So I am going to click on Means
Plot and then I will click Continue.
| | 03:34 | Now we're back in the main
dialog and I will click OK.
| | 03:39 | Here we have several tables that show up.
| | 03:42 | The first one is the Descriptive Statistics.
| | 03:44 | It gives me the mean for each of
the four groups in this Factor.
| | 03:48 | It tells me, for instance, that the
relative interest in searching for the NFL
| | 03:52 | in the Northeast is below average. It's -.36.
| | 03:55 | That means that one-third of the
standard deviation below the national average
| | 04:00 | for states and relative
interest in searches for the NFL.
| | 04:04 | The Midwest, on the other hand, is much higher.
| | 04:06 | It's three quarters of a standard
deviation above the mean, with a mean of 0.75.
| | 04:12 | The South is slightly below 0 at -.07.
| | 04:16 | And the West is, again, about a third
of a standard deviation below 0, at -.33.
| | 04:22 | The next column over is the Standard
Deviations and they go from about .8 to
| | 04:26 | 1.1, and they're not hugely different,
and they feed into the Standard Error,
| | 04:30 | which is used for the inferential tests.
| | 04:32 | But otherwise we can ignore these.
| | 04:35 | Now, this is the Analysis of Variance
table or ANOVA table and what it does is
| | 04:39 | on the top corner it tells me that it's
looking at the variable NFL and you see
| | 04:44 | that it's statistically significant.
In the last column under Sig it has .020.
| | 04:48 | That's the probability value for these,
and the general guideline is if it's
| | 04:52 | under .05, it's statistically significant.
| | 04:55 | Beneath that are the results
for the Tukey Post Hoc Test.
| | 04:59 | Now, this first table of Multiple Comparisons
is kind of complicated and we can ignore it.
| | 05:03 | Let's go to the one beneath it.
| | 05:06 | This one is called Homogeneous Subsets
and what this does is it places the
| | 05:10 | groups in like with like, and this
tells us that the Northeast and the West
| | 05:16 | and the South are all relatively
similar to each other in terms of their
| | 05:20 | searching for NFL and Google.
| | 05:22 | You can see they all have negative means.
| | 05:25 | On the other hand, the second
group is kind of interesting.
| | 05:28 | Midwest is much higher, so that makes sense.
| | 05:30 | The South is still with it and the
reason for that is even though the South
| | 05:34 | and the Midwest are different from
each other, they still have some overlap
| | 05:38 | with the Standard Deviation.
| | 05:39 | So they are not significantly different
from each other and this becomes clear
| | 05:42 | if we go down one more
and look at the Means Plot.
| | 05:45 | Here you can see that the Midwest is
much higher, and the South, while it's down
| | 05:49 | lower, is still above
the West and the Northeast.
| | 05:51 | So the Northeast, the South, and the
West all form a group, but the Midwest and
| | 05:56 | the South actually combine as well.
| | 05:58 | But the point here is we are able to
do a lot of comparisons and get a lot of
| | 06:02 | information from this one test.
| | 06:04 | The Analysis of Variance is a very
flexible and useful procedure for comparing
| | 06:08 | the means of several different groups.
| | 06:10 | In combination with a graphical
analysis and Post Hoc Tests, you can get a lot
| | 06:14 | of insight in a little bit of time.
| | 06:16 | In the next movie, however, we'll
backtrack just a little to look at a variation
| | 06:20 | on the T-Test, one in which you can look
at changes over time for a single group
| | 06:24 | of people or look at differences
between two different variables using what's
| | 06:28 | called the Paired T-Test.
| | Collapse this transcript |
| Comparing paired means| 00:00 | In the last few movies, we have
looked at procedures that can compare the
| | 00:03 | average score of two or more
groups on a single variable.
| | 00:07 | However, there may been times when you
are more interested in comparing the same
| | 00:11 | group on two variables, either the same
idea measured at two point in time or on
| | 00:16 | two related variables
that are on the same scale.
| | 00:20 | In that case, you will want to use
something called a paired t-test also known
| | 00:24 | as a within subjects t-test
or repeated measures t-test.
| | 00:28 | The nice thing about this test is that
each person serves as their own little
| | 00:31 | comparison or control group
which makes it much more precise.
| | 00:35 | In fact, what's really going on with
this test is that you are getting the
| | 00:38 | difference between each variable for
each person and you're looking at that
| | 00:43 | change between the two and then
doing simply a one sample t-test on those
| | 00:48 | different scores, just like
we did in an earlier section.
| | 00:51 | For this example, I am going
to be using a new dataset that's
| | 00:54 | called Success.sav.
| | 00:55 | This is from a survey of adults in the
Midwest on how much money they felt a
| | 01:01 | person needed to earn annually to be
considered successful as the first variable,
| | 01:06 | and then also how much money they felt a
person needed to earn annually in order
| | 01:10 | to be happy and we are looking at
whether there's a difference in the means
| | 01:14 | between these two groups.
| | 01:15 | To do this, we come up to Analyze, to
Compare Means, to the Paired Samples
| | 01:21 | T-test, and what you need to do
is select two variables at a time.
| | 01:25 | This is easy because we only have two variables.
| | 01:28 | So I select the both of them over here,
I am Shift+clicking, and then you move
| | 01:32 | them over to the right as a paired variable.
| | 01:34 | Now let's take a quick look at the Options.
| | 01:37 | You get a Confidence Interval of
the difference as 95% by default.
| | 01:41 | You can change it to 90 or
something if you have a really small sample.
| | 01:45 | Also, you can talk about how you exclude
cases either by analysis-by-analysis or
| | 01:49 | list wise but since we are only making
one comparison, these will be the same.
| | 01:53 | So I am just going to press
Continue and then here I will press OK.
| | 01:58 | We get a few tables of
output from these procedures.
| | 02:00 | The first one gives the
Descriptive Statistics for the two variables.
| | 02:04 | So for instance, we see that for this
particular sample, the average amount of
| | 02:08 | money that people felt a person
needed to make annually to be considered
| | 02:12 | successful was $64,000. That had a
standard deviation of about $35,000.
| | 02:18 | On the other hand, the amount of money
that people thought a person needed in
| | 02:21 | order to be happy was lower at
about $42,000 a year, with a standard
| | 02:26 | deviation of about $33,000.
| | 02:28 | That table also has the standard
error of the mean at the end. That simply
| | 02:31 | goes into calculating the inferential
statistics and we don't need to deal with it directly.
| | 02:35 | The second table is the Paired Samples
Correlation, because these are the scores
| | 02:39 | for the same group of people each
person answered the both of them, you can
| | 02:43 | calculate a correlation and we see here
that we have a statistically significant
| | 02:48 | positive correlation.
| | 02:49 | What that means is people who put a
high answer for one question, for instance,
| | 02:53 | how much he needed to be successful,
are also more likely to put a high answer
| | 02:57 | for how much you needed
to be happy and vice versa.
| | 03:00 | People who put a low answer would
generally put a lower answer for the both of them.
| | 03:04 | But the important question about
whether people put different answers for the
| | 03:07 | two of these is answered in the next one.
| | 03:10 | We see that if we take each person's
response to the question how much money
| | 03:13 | you need to be successful, and
subtract the amount of money you need to be
| | 03:17 | happy, the difference between those
is about $22,000 a year with a standard
| | 03:22 | deviation of $30,602.
| | 03:26 | The standard error for that difference
is next, but we can ignore that and then
| | 03:30 | we have a Confidence Interval for the
difference and this lets us know that
| | 03:33 | while the difference in this
particular sample was about $22,000 a year.
| | 03:38 | In the larger population the difference
between the amount of money you need to
| | 03:42 | be successful and to be happy could
be anywhere between $16,700 and $27,600.
| | 03:49 | The next column that says T.
That's the actual inferential test.
| | 03:52 | That's the one sample t-test.
| | 03:54 | We have a value of 8.079.
| | 03:57 | The next column, the decrease of freedom is
related to how many people there are in the sample.
| | 04:02 | The last one here of interest and
that is the significance level, the
| | 04:05 | probability value for the hypothesis test.
| | 04:08 | In this case, it says 000 and that
means it is actually less than 001.
| | 04:13 | It's a very small probability value and
this means that this is a statistically
| | 04:16 | significant difference.
| | 04:18 | On the other hand, looking up at the
top table where the first mean was $64,000
| | 04:22 | and the second mean was $42,000,
we can see there is a big difference of
| | 04:26 | $22,000 between what people believe you
need to make to be successful and what
| | 04:31 | you need to be happy.
| | 04:33 | So this example shows another variation
on the procedure that SPSS gives you to
| | 04:38 | compare means, only this time it
compares means on two different variables for
| | 04:42 | a single group of people.
| | 04:44 | I should mention it's also possible
to look at changes in several points in
| | 04:47 | time or differences in the
evaluations of several different products and
| | 04:50 | variables but those procedures become
rather complicated and we won't address
| | 04:54 | them in this course.
| | 04:56 | We will, however, start looking at ways
to explore the relationships of three or
| | 04:59 | more variables at a time,
starting with the next movie.
| | Collapse this transcript |
|
|
9. Charts for Three or More VariablesCreating clustered bar charts for frequencies| 00:00 | Up to this point, we've covered methods
for looking at one variable at a time as
| | 00:05 | well as methods for looking at the
associations between pairs of variables.
| | 00:09 | In each case and consistent with good
analytical practices, we started with charts
| | 00:13 | because data is usually much
easier to understand visually.
| | 00:17 | Then we've done numerical descriptions
of the variables and associations, and
| | 00:22 | finally, we've done inferential
statistics to generalize beyond the given data.
| | 00:27 | In these last few sections, we'll take
that pattern one more step by looking at
| | 00:31 | methods for exploring the relationships
of three or more variables, first with
| | 00:35 | graphs and then with numbers.
| | 00:37 | A quick word about terminologies in
order, when you look at one variable at a
| | 00:41 | time it's called a univariate analysis.
| | 00:44 | When you look at the associations
between pairs of variables, it's called
| | 00:48 | a bivariate analysis.
| | 00:49 | Therefore it would make sense that when
you're looking at multiple variables, it
| | 00:54 | would be called a multivariate analysis.
| | 00:56 | However, that term multivariate is
typically reserved for situations where you
| | 01:01 | specifically have more
than one outcome variable.
| | 01:05 | Those kinds of statistics are much,
much more complicated than what we're
| | 01:08 | going to be doing, which is using
more than one predictor variable with a
| | 01:13 | single outcome variable.
| | 01:15 | So I will generally avoid the term
multivariate and instead just talk
| | 01:19 | about multiple variables.
| | 01:21 | With that in mind, let's look at our
first chart for multiple variables.
| | 01:25 | And just like when we did charts for
one variable or pairs of variables, we'll
| | 01:29 | begin with bar chart for categorical variables.
| | 01:32 | Just this time, we'll have
three categorical variables.
| | 01:35 | To demonstrate this, I'll use the
General Social Survey dataset in GSS.sav.
| | 01:41 | What we need to do is begin by going
up to Graphs in the menu bar and we
| | 01:46 | come to Chart Builder.
| | 01:47 | Then we come down to Bar, except
instead of Simple, we're going to use
| | 01:51 | Clustered this time.
| | 01:53 | So I drag the Clustered
bar chart up to the canvas.
| | 01:56 | What we're going to look at as an
outcome variable in this particular example is
| | 02:02 | a person's self-rated happiness.
| | 02:04 | Sometimes the easiest way to look at
your outcome variable is to make it so that
| | 02:07 | the colors of the bars go there.
| | 02:09 | So I'm going to take self-rated
happiness and I'm going to drag it over to
| | 02:12 | Cluster on X set color.
| | 02:15 | Then we need a
categorical variable on the X-axis.
| | 02:19 | I thought it would be interesting to
see whether a person had attended a live
| | 02:22 | drama in the last year.
| | 02:24 | I'll put that on the X-axis.
| | 02:27 | So that's two categorical variables
for using attendants at a live drama to
| | 02:31 | predict self-rated happiness,
but that's just two variables.
| | 02:34 | We need a third one and to do that, we
have to come down to this tab that says
| | 02:38 | Groups and Point ID.
| | 02:40 | I click on that, then I come down to
either adding a Rows panel variable or a
| | 02:45 | Columns panel variable.
| | 02:47 | And all that influences is whether the
charts show up one above the other or
| | 02:50 | one next to the other.
| | 02:52 | In order to keep it compact, I'm
going to do a Rows panel variable.
| | 02:56 | Then I need to add one more
variable that creates pairs of charts.
| | 03:01 | And I'm going to use gender.
| | 03:02 | I'm just going to come right up here to this
one that says Male and drag that over here.
| | 03:10 | And so you see what I'll end up
with is four groups of three bars.
| | 03:14 | Now I just come down to OK
and I can make the chart.
| | 03:18 | There is a lot of code that goes into that,
and we can save that for future reference.
| | 03:23 | And then what we have here is bar charts.
| | 03:26 | On the left, we have whether people
attended a live drama in the last year.
| | 03:29 | More people have not. It's about 3:1.
| | 03:32 | And then on the right are people
who say they have attended one.
| | 03:36 | The top two are for women.
| | 03:38 | The bottom two are for men.
| | 03:40 | The blue bars are not too happy, the
green bars are pretty happy, and the beige
| | 03:45 | bars are very happy.
| | 03:47 | We do have one small problem with
this chart and that is that a lot smaller
| | 03:52 | number of people have seen
a live drama in the last year.
| | 03:55 | That's because we're charting counts here.
| | 03:57 | A really handy feature in SPSS is the
ability to chart percentages as well.
| | 04:02 | So I'm going to show you
how to go back and do that.
| | 04:04 | I'm going to come back up to our
most recent command, to Graphs, to Chart
| | 04:09 | Builder, and then here in the elements
property, I have Statistics and it says Count.
| | 04:15 | That's how many people are in each category.
| | 04:17 | I'm going to click on that and instead
I'm going to go to Percentage and that
| | 04:21 | has a question mark
because I have to set a parameter over here.
| | 04:25 | I find the most helpful
one as each X-axis category.
| | 04:29 | So what this'll do, it'll make things
add up to 100% for those who have and for
| | 04:33 | those who have not seen drama.
| | 04:36 | So I select that. I click Continue.
| | 04:38 | I have to come down here and press Apply
and then I come over here and press OK.
| | 04:45 | And what you'll see now is that the
chart will look slightly different.
| | 04:48 | The biggest difference is
that the bars on the right side,
| | 04:51 | for those who have seen live drama in
the last year, are much larger than they
| | 04:55 | were before because using percentages
has equalized the two groups and it makes
| | 05:02 | it much easier to see the pattern.
| | 05:03 | For instance, we see that those who
attended the live drama last year,
| | 05:08 | interestingly, for men, those who have
seen the live drama, the percentage who
| | 05:14 | are very happy is smaller than the
percentage of those who were pretty happy.
| | 05:18 | On the other hand, for women, the
percentage of people who were very happy is
| | 05:23 | slightly higher than the percentage of
people who were pretty happy for those
| | 05:26 | who have seen a drama in the last year.
| | 05:28 | On the other hand, for those who have
not seen a drama, the patterns are nearly
| | 05:31 | identical for men and for women. Where
most people are pretty happy, the next
| | 05:36 | group is very happy, and the
least common is not too happy.
| | 05:40 | A clustered bar chart could be a handy
way to depict the relationships of these
| | 05:44 | three categorical variables.
| | 05:46 | However, you'll probably want to chart
percentages instead of counts, but your
| | 05:50 | choice of denominator can make a big
difference on how the final chart looks.
| | 05:54 | This gets back to a point that data
analysis is probably best thought of as a
| | 05:59 | form of storytelling and you want to
choose displays that help you tell your
| | 06:03 | story well or that help the data tell
you something interesting and unexpected.
| | 06:09 | It's worth noting that if your
outcome variable is a dichotomous indicator
| | 06:13 | variable, that's a 0/1, yes/no
variable, then you can sometimes make things
| | 06:17 | easier by charting the mean of the
outcome which for 0/1 indicator variable will
| | 06:22 | be the proportion of people who got 1s,
for example, the proportion who are
| | 06:26 | returning customers as
opposed to first-time customers.
| | 06:29 | And this leads us to the next chart
we'll cover, the clustered bar chart for means.
| | Collapse this transcript |
| Creating clustered bar charts for means| 00:00 | In the last movie, we looked at how you
can make a clustered bar chart to show
| | 00:05 | the association between three
different categorical variables.
| | 00:09 | In this movie, we'll look at how
to show the associations between two
| | 00:13 | categorical predictor variables and a single
outcome variable that is scaled or quantitative.
| | 00:18 | For example, you may want to show the
average purchase price of items bought by
| | 00:22 | men and women in two
different retail categories.
| | 00:25 | Surprisingly, this kind of chart is
even simpler than the categorical version
| | 00:29 | we just covered, because that one required
that we use panels to show all three variables.
| | 00:34 | With the scaled outcome though, we
can use just a single panel like this.
| | 00:39 | In this example, I am going to be using the
General Social Survey data. GSS.sav again.
| | 00:45 | To make the chart, let's go
up to Graphs to Chart Builder.
| | 00:50 | From there, we are going to come down
to the Gallery to Bar Charts and choose a
| | 00:54 | clustered bar chart.
| | 00:56 | We'll drag that up here and what we
are going to do is get our two predictor
| | 01:00 | variables, placed one on the X-axis, one
to set the cluster on X, the set color,
| | 01:07 | and the third one, the Y-axis,
will be our outcome variable.
| | 01:10 | In this case, I'm going to try to predict
family income. That will be my outcome variable.
| | 01:16 | So I'll just grab family income and take
it over to the Y-axis and I am going to
| | 01:20 | use two variables to predict that.
| | 01:21 | One is whether a person is a male or female.
| | 01:26 | I am going to drag that down here to
the X-axis. And another one is whether a
| | 01:30 | person has children or not.
| | 01:32 | I'll bring that over here.
| | 01:35 | I think it'd also be helpful to put
on error bars and I'll click Apply.
| | 01:43 | Then I'll come back over here and click OK.
| | 01:47 | There is a lot of code that goes into this
so we can save and reuse later if we want.
| | 01:53 | But here's the actual chart.
| | 01:55 | So what we have here is women
on the left and men on the right.
| | 02:00 | People who do not have children are in
blue and people who do have children are
| | 02:04 | in green, and what's charted on the Y-axis is
the mean family income that people reported.
| | 02:10 | What's interesting about this is we have
an interaction and that is, for women,
| | 02:16 | those who do not have children reported
a slightly higher average family income
| | 02:21 | than those who do have children, although
the standard deviation, the spread on
| | 02:26 | these, is pretty big.
| | 02:28 | On the other hand, for men,
the exact opposite is observed.
| | 02:32 | That men who have children report a
substantially higher family income than
| | 02:37 | those who do not have children.
| | 02:38 | That's about 25,000-40,000.
| | 02:42 | Now again, all this chart is showing us
there is an association between the variables.
| | 02:46 | It doesn't explain why
those differences are there.
| | 02:48 | There are a lot of reasons that go into
that and it could actually require some
| | 02:51 | pretty nuanced investigation.
| | 02:54 | Nevertheless, this is a very simple
chart that shows how two predictor
| | 02:58 | variables, male/female as one
category, and having children, yes or no, as
| | 03:02 | another, can be used to predict
scores on a third quantitative or scale
| | 03:07 | variable, in this case, family income.
| | 03:10 | So clustered bar charts for me is an
easy and informative way to show how two
| | 03:14 | categorical predictors are associated
with the scaled outcome or an indicator
| | 03:19 | outcome if you are doing 01.
| | 03:21 | They also give a good idea of what the
results of the inferential test would be.
| | 03:25 | This kind of clustered bar chart can be
one of the most effective tools that you
| | 03:29 | have in exploring, analyzing,
and presenting your own data.
| | 03:33 | In the next movie, we'll look at
another simple variation on a chart for when
| | 03:37 | you have just one categorical
variable and two scaled variables.
| | 03:41 | In this case, the
scatter plot as group markers.
| | Collapse this transcript |
| Creating scatterplots by group| 00:00 | In the last pair of movies we've looked
at the variations on the bar chart that
| | 00:04 | let you use two categorical variables
to predict scores on a third categorical
| | 00:08 | variable or on a scale variable.
| | 00:11 | In this movie, we'll change the balance
a little by looking at a chart for times
| | 00:15 | when you have two scaled
variables and one category.
| | 00:18 | This calls for a simple variation on
the scatter plot that we covered in this
| | 00:22 | section on bivariate graphs.
| | 00:24 | The only big difference is that
we'll be adding group markers for the
| | 00:27 | categorical variable.
| | 00:29 | In this example, I'll use the Google
Search's information from Searches.sav.
| | 00:34 | To get this, I need to go
over to Graphs to Chart Builder.
| | 00:39 | From there, I go down to the bottom-
left on gallery and I go to Scatter.
| | 00:42 | Now I wanted to use the second one on
the top, which is called a Group Scatter,
| | 00:48 | and I drag that out to the canvas.
| | 00:50 | From there, I need to get my
predictor variable. Let's scale my predictor
| | 00:55 | variable that's the category and my
outcome variable that's a scaled variable.
| | 01:00 | For this example, I am going to use
interest in the NBA as a search term.
| | 01:04 | So I am going to come over here and get
NBA as a Google search term. I am going
| | 01:11 | to drag that over to the Y-axis.
| | 01:13 | Then I am going to use two predictors.
| | 01:15 | One is I am going to use the median age of
people who live in the state. That median age.
| | 01:24 | That's a scaled variable.
| | 01:24 | so I am going to put it in the X-axis
and then it makes sense to me that
| | 01:29 | interest in the NBA would be related
to whether a state has an NBA team.
| | 01:33 | So I am going to get has NBA that as a 01
indicator variable and drag that over to set color.
| | 01:40 | Finally, as I scatter plot, you can
sometimes find unusual points and you want
| | 01:43 | to see who they are.
| | 01:44 | So I am going to come down to
the tab for groups and points ID.
| | 01:48 | There I am going to click on
point ID label at the bottom.
| | 01:52 | Back on the canvas is add the box for
point label variable, and I am going
| | 01:56 | to use this state code.
| | 01:58 | So I'll just drag that
over and now I am ready to go.
| | 02:03 | Press OK and I get a slightly
complicated chart because of all the data names.
| | 02:08 | I am going to edit those out for a
moment, but because I've used a variable,
| | 02:12 | I'll be able to bring
some of them back if I want.
| | 02:14 | So I'll double click on it.
| | 02:17 | I can just select the names
and I hit Delete for right now.
| | 02:20 | So what I have is a bunch of blue
circles and a bunch of green circles.
| | 02:25 | The blue circles are for
states that do not have NBA teams.
| | 02:29 | The green circles are for states that do.
| | 02:31 | To make these little bit easier I am
going to modify them and make them solid.
| | 02:35 | I'll just click on one.
| | 02:36 | It looks like I better click again to
get just the green ones and click on
| | 02:41 | Fill and make that the same shade,
green, and that actually has an effect of
| | 02:47 | making all of them solid.
| | 02:48 | Now what I can do is I can
click on regression lines.
| | 02:52 | up on the menu bar here at the
second option is called Add fit line at
| | 02:57 | subgroups, as for regression
line, separately for each group.
| | 03:01 | I can click on that and I get two lines.
| | 03:03 | One in green for the states that do
have NBA teams and one in blue for
| | 03:08 | the states that don't.
| | 03:10 | I also see that we have an outlier and
what I am going to do is I am going to
| | 03:14 | come over to the left of this
bar to the little target thing.
| | 03:17 | There's the data label mode.
| | 03:19 | I can click on that and now because
earlier I said that I was going to use the
| | 03:22 | state abbreviations as data labels, I
can come right down here, click on this
| | 03:27 | outlier, and I can see that it's Utah.
| | 03:30 | Now there's something then.
| | 03:32 | The Utah Jazz seems to elicit
unusual levels of fan support.
| | 03:37 | Also people in Utah tend to
be rather young on average.
| | 03:40 | I am going to close this chart because
I am done editing it, and now I can see
| | 03:45 | that there is an association between
age and whether a state has an NBA team
| | 03:51 | that can predict their level of
interest in NBA as a search term.
| | 03:55 | Just as we saw with bivariate graphs,
scatter plots are great way to show the
| | 03:59 | relationship between two scaled
variables, and then by simply changing the
| | 04:04 | markers, you can add a third categorical
variable and you can even see how that
| | 04:09 | new variable changed the
relationship between the other two.
| | Collapse this transcript |
| Creating 3-D scatterplots| 00:00 | If you have three scale variables that
you want to graph, then one interesting
| | 00:05 | option in SPSS is a 3D scatterplot
where you have variables on three different
| | 00:10 | axes, the X and the Y and the Z axis.
| | 00:13 | In theory it's a straightforward
variation on the regular 2D scatterplot.
| | 00:17 | In practice though, it can get a
little confusing and this will become clear
| | 00:21 | after we look at one.
| | 00:23 | For this example I am going to
stay with the Google Searches data and
| | 00:26 | Searches.sav and I am going to chart
the relationship between three particular
| | 00:31 | search terms, between searches for
SPSS, for business intelligence, and for
| | 00:36 | the term "totally lost."
| | 00:38 | To do this, I go up to Graphs in the
menu bar and I click on Chart Builder.
| | 00:44 | I come down in the gallery to scatter,
then I am going to choose the third one
| | 00:49 | here which is a 3D scatterplot.
| | 00:52 | Interestingly, there is an option here
of adding a categorical variable on top
| | 00:56 | of it all which actually makes it four
variables depicted at once, but I am not
| | 01:00 | going to work with that one right now.
| | 01:02 | I am just going to show you what's
called this the simple 3D scatter.
| | 01:05 | I'll click on that, and
drag it up to the canvas.
| | 01:09 | Then I need to pick my three variables,
the X and the Y and the Z, and what I am
| | 01:14 | going to choose is SPSS as our Y axis,
Business Intelligence as our X axis, and
| | 01:24 | Totally Lost as the Z axis, and
from there I can simply click OK.
| | 01:31 | When we first get the chart it's a
rather chunky looking orthographic
| | 01:35 | projection of a bunch of circles
floating in what appears to be a 3D space.
| | 01:40 | Unfortunately it's hard to read and
there are two ways of getting some sense of
| | 01:44 | depth in this. One doesn't work very
well and the other one works slightly
| | 01:48 | better. I will show you both.
| | 01:49 | First, we're going to need to edit the
chart by double-clicking on it and then I
| | 01:53 | am going to clean things up a little bit
by getting rid of the decimal places on
| | 01:57 | the axes, click on those, then I come up
to Number Format and I am just going to
| | 02:02 | put 0 and press Apply.
| | 02:05 | I will do it for the other ones
and then there is the last one.
| | 02:14 | Okay, now to try to get a sense of
depth here, one choice is to click on these
| | 02:20 | then come over to Spikes and choose
Floor and click Apply, and that I think you
| | 02:26 | can tell is not helpful.
| | 02:28 | We have a bunch of pinpoints here but it
just seems to make things much more complicated.
| | 02:32 | So I am going to click on those again,
go back to Spikes, and deselect them.
| | 02:37 | On the other hand, we do have another option.
| | 02:40 | Now, I am going to first take these markers.
| | 02:42 | I am just going to make them solid so
they are a little easier to see as we
| | 02:47 | take care of things, and what I do to
this is with a 3D chart you can actually
| | 02:51 | add motion. You can make this dynamic chart.
| | 02:54 | If I come over to the chart and I right
-click on it, the second choice is this
| | 02:59 | one, this says 3D Rotation.
| | 03:02 | And what I can do now is see how the
cursor is turned into a hand? I can click
| | 03:05 | on that and I can start moving things around.
| | 03:08 | Now, it's kind of fun. I can see there
is an outlier there of some kind, right
| | 03:16 | over here, and I believe from past
experience that is Washington, D.C. I can
| | 03:22 | get state labels and confirm that.
| | 03:24 | But right now what I am going to do is
I am just moving this around and when
| | 03:27 | it's moving you can get a sense of a
three-dimensional cloud of data, and it's
| | 03:33 | kind of a neat way to do it.
| | 03:34 | The problem of course is
it's really hard to read.
| | 03:38 | I don't really know what's what except
there seems to be an outlier there and
| | 03:42 | there appears to be some kind of
association between the variables.
| | 03:46 | I can see that there is an
association in 3D, but it's hard to read.
| | 03:51 | A rotating three-dimensional
interactive scatterplot can be a lot of fun.
| | 03:55 | You can even add a four-dimensional
variable with colored markers and it helps you
| | 03:59 | to identify cases that are multivariate
outliers, and that is that have unusual
| | 04:04 | combination of scores.
| | 04:05 | On the other hand, the problem is once
the 3D chart stops rotating, it becomes
| | 04:10 | just another flat 2D chart
that's very hard to read.
| | 04:14 | And for this reason, a better option
might be to employ what are called multiple
| | 04:19 | static 2D charts in a scatterplot
matrix which is what I will show you next.
| | Collapse this transcript |
| Creating scatterplot matrices| 00:00 | In the last movie we looked at a way of
showing three scaled variables and maybe
| | 00:05 | even a fourth categorical
variable on top using the 3D scatterplot.
| | 00:10 | Well, that seems like an intuitive
approach and while they certainly are a lot
| | 00:14 | of fun to play with while rotating
the display, they can get confusing and
| | 00:18 | also once they stop rotating, they're just
another static 2D display that's poorly labeled.
| | 00:24 | Nevertheless, it's important to be
able to see the relationships between
| | 00:28 | groups or variables.
| | 00:29 | Fortunately, a slightly lower tech, but
more effective solution is available by
| | 00:34 | taking advantage of what the data
visualization people call small multiples.
| | 00:38 | That is we can make an entire
collection of 2D scatterplots that are connected
| | 00:43 | to each other in the matrix, which makes
it easier to see how the relationships
| | 00:47 | between about as many
variables as you have screen space for.
| | 00:50 | Let's see how this works.
| | 00:53 | I'm going to again use the
Google Search data in Searches.sav.
| | 00:57 | I need to go up to Graphs, then to
Chart Builder. From there I come down on
| | 01:02 | the gallery on the left to Scatter,
and the third on the bottom is called
| | 01:09 | Scatterplot Matrix.
| | 01:10 | I am going to click that
and drag it up to the canvas.
| | 01:13 | Now it looks a little funny here and on
the bottom it just says Scatter Matrix.
| | 01:17 | You'll see there's only
one place to add variables.
| | 01:20 | That's because I can add more
than one variable to that list.
| | 01:23 | In this particular case what I am
going to do is I am going to choose let's
| | 01:27 | say five variables.
| | 01:28 | I am going to take SPSS.
| | 01:30 | I am going to take Business
Intelligence and I just drag it down.
| | 01:36 | You see how it turns into a red plus there.
| | 01:39 | I'll get Totally Lost.
| | 01:41 | I will also get Facebook.
| | 01:47 | And finally, I think I'll give an
indication of level of education.
| | 01:54 | So what I've done is I've dragged five
variables into this box at the bottom.
| | 01:59 | Just in case I need it I'm going to
come to Groups and Point ID and I am going
| | 02:05 | to add a point ID label.
| | 02:06 | I will use the state code and drag
that hear to the Point Label variable and
| | 02:13 | then I can click OK.
| | 02:16 | I get an extremely complicated
looking chart, but this can be fixed.
| | 02:21 | We need to edit it a little bit.
| | 02:22 | I am going to double-click on it.
| | 02:26 | The first thing I am going to do is
I am going to remove this day labels.
| | 02:29 | I may need those later, but for
right now I can take them out.
| | 02:34 | Then the next thing I am going to do
is I am going to make the chart bigger.
| | 02:38 | Right now the chart size is 375x468.
| | 02:40 | I am just going to make it say for
instance 500 and that gets it to 625.
| | 02:50 | When I do that and I maximize this window,
I can actually read all of the labels.
| | 02:56 | I can see things more clearly.
| | 02:57 | Next I am going to make these
dots smaller. I'll click on those.
| | 03:01 | Let's go to 3 point and I will make them solid.
| | 03:07 | And now it's a little easier to
see them distinguished from each other.
| | 03:12 | The next I will do is I am going
to add a regression line and I'll go
| | 03:17 | through all of them.
| | 03:18 | Let me click on this and there we have it.
| | 03:22 | I can close this all now.
| | 03:24 | Now I'm going to change the
color of that regression line.
| | 03:26 | I will make it a dark red instead of
red so it doesn't jump out quite so much.
| | 03:34 | What you have is each variable
paired with the others by going across.
| | 03:38 | So for instance on the top row where
it says SPSS on the side, this is the
| | 03:43 | relative importance of
SPSS as a Google Search term.
| | 03:47 | That's SPSS on the Y axis
for all of the other ones.
| | 03:51 | So, for instance, on the top row
in the second column that's Business
| | 03:55 | Intelligence across the
bottom and SPSS up the side.
| | 03:59 | The one next to it is Totally Lost across the
bottom of the X axis and SPSS on the Y axis.
| | 04:05 | What you can see is when the
regression lines are sloped that you can see
| | 04:10 | their associations.
| | 04:12 | So for instance there's a very
strong association in the top row between
| | 04:16 | SPSS and Totally Lost.
| | 04:18 | That's the one in the middle on the top.
| | 04:21 | On the other hand there's a little bit
less of an association between SPSS and
| | 04:25 | Facebook, the one right next to it.
| | 04:27 | That line is relatively flat.
| | 04:29 | On the other hand we do have outliers
showing on some of these and it might be
| | 04:33 | interesting to see who that is.
| | 04:35 | So I am going to double-click on the chart.
| | 04:36 | I am going to turn on the Data Label
mode by clicking in the menu bar here.
| | 04:42 | I am going to find our little outlier
here and just click on it and it will
| | 04:47 | label it in all of the charts.
| | 04:50 | And as is frequently the
case it's Washington D.C.
| | 04:52 | So we can see Washington D.C. is an
outlier in most of these charts.
| | 05:00 | A scatterplot matrix in SPSS is a
great way to see the connections between
| | 05:05 | multiple variables all at once.
| | 05:08 | It's easier to read than a 3D
scatterplot and it lets you include more variables
| | 05:12 | than you might otherwise be able to do.
| | 05:14 | It's also a great tool to get a lot of
visual detail from your data all at once,
| | 05:18 | which is after all the purpose of data graphics.
| | 05:20 | Now that we've covered several
different combinations of variables and chart
| | 05:25 | we will turn next to the descriptive and
inferential statistics that can be used
| | 05:29 | when looking at the
associations of three or more variables.
| | Collapse this transcript |
|
|
10. Descriptive Statistics for Three or More VariablesUsing Automatic Linear Models| 00:00 | In the last section, we looked at ways
to chart the relationship of three or
| | 00:05 | more variables at a time.
| | 00:07 | In this section, we'll look at ways to
give precise numerical descriptions to
| | 00:11 | those relationships as well as inferential
tests to check the reliability of our numbers.
| | 00:17 | The very first procedure that we're
going to cover here is one of the most
| | 00:20 | impressive features that
SPSS has added for version 19.
| | 00:24 | It's called Automatic Linear Modeling.
| | 00:27 | It's a huge step towards making data
analysis a little easier, a little more
| | 00:31 | accurate, and a lot more
interpretable for a lot more people.
| | 00:34 | Don't worry if you have an earlier
version of SPSS. I'll also show you how to
| | 00:39 | accomplish the same goals using
procedures that are available in every version
| | 00:43 | of SPSS in the next video.
| | 00:46 | The goal of SPSS's Automatic Linear
Modeling function and linear regression in
| | 00:50 | general is to have an entire
group of predictor variables.
| | 00:55 | This can be scale variables, or ordinal,
or dichotomous indicator variables.
| | 00:59 | That's the 0/1 variables.
| | 01:01 | You can even use multiple group
categories if you break them down into a series
| | 01:05 | of dichotomous variables.
| | 01:07 | But the goal of linear regression is to
take these predictors and find the best
| | 01:11 | way to combine them to predict values
on a single scaled outcome variable.
| | 01:16 | While the mathematics behind this can
get very involved and there are plenty of
| | 01:20 | decisions that can be made, the
Automatic Linear Modeling procedure has been
| | 01:24 | developed to keep most of that in
the background and to let you focus on
| | 01:27 | interpreting your data.
| | 01:29 | This is how it works.
| | 01:31 | To get to the Automatic Linear
Modeling, we first go to Analyze, then down
| | 01:36 | to Regression, and then over to Automatic
Linear Modeling, which is the first choice.
| | 01:42 | From this, SPSS takes the information
that we gave it about the variables about
| | 01:46 | whether they were predictors.
| | 01:49 | That is, they were input variables
or whether they were targets or
| | 01:51 | whether they were both.
| | 01:53 | So this is a situation where the role
that we gave a variable in the dataset
| | 01:57 | makes a difference in how things work out.
| | 01:59 | The first thing we need to do
is pick our target variable.
| | 02:02 | I'm going to use searches for the term SPSS.
| | 02:05 | That will be my target variable.
| | 02:08 | Now, it's going to ask me what I
want my predictor variables to be.
| | 02:12 | I'm going to add a bunch of these
ones about other searches in Google.
| | 02:17 | I can put those in here.
| | 02:20 | I can leave those in with the other
indicators about whether they have an NFL
| | 02:24 | team, or an NBA team, or a
Major League Soccer team.
| | 02:27 | I can have this information
about Census Bureau Region.
| | 02:31 | I'm going to remove these four about
Census Bureau Division, because that's just
| | 02:35 | subcategories of the region.
| | 02:37 | So I'm going to remove that.
| | 02:38 | Then these three, Northeast, Midwest,
and South, are indicator variables that
| | 02:43 | I use for the region.
| | 02:44 | However, the nice thing about
Automatic Linear Modeling is you can put
| | 02:48 | categorical variables with several
categories in them and it will break them up
| | 02:52 | in a way that makes best sense for the data.
| | 02:55 | So you can leave categorical
variables in there as they are.
| | 02:58 | I don't need these dichotomous ones as a backup.
| | 03:01 | So this is the list of potential
variables that I can use as predictors, to try
| | 03:07 | to get the relative importance by a
state of SPSS as a search term in Google.
| | 03:14 | I'm then going to come up here to Build Options.
| | 03:17 | It has been our objective and
we have a creative standard model.
| | 03:21 | That's what we're going to do.
| | 03:22 | The other ones that are called Boosting,
and Bagging, and the Large Datasets,
| | 03:27 | those are technical things
that we don't need to worry about.
| | 03:29 | However, I am going to come to Basics,
and this is asking me whether I want it
| | 03:34 | to automatically prepare data and
truthfully, this is a wonderful thing.
| | 03:37 | It's a great way to deal with
outliers and to transform variables and to
| | 03:41 | make substitutions and it's one of the big
perks of the Automatic Linear Modeling approach.
| | 03:46 | The next thing I'm going
to go to is Model Selection.
| | 03:49 | This is where things can get
very complicated in regression.
| | 03:54 | It's asking the Model Selection Method.
| | 03:56 | That is, how it decides which
variables to put into the regression model.
| | 04:01 | I have several options. Forward Stepwise.
| | 04:03 | I'll say one that says just put them
all and then leave them there, and another
| | 04:07 | one called Best Subsets.
| | 04:09 | Now, when we get to the Linear
Regression Command that's separate from this one,
| | 04:13 | you'll see that we have some different options.
| | 04:15 | I'm just going to leave this at
Forward Stepwise, because it can make life
| | 04:19 | a little bit simpler.
| | 04:20 | There is also an issue here
about what criterion it wants to use.
| | 04:24 | There are several choices here.
| | 04:26 | The AICc, there is also the F-
statistic, and adjusted R-squared.
| | 04:31 | Let's not worry about that.
| | 04:32 | Let's just use the Information Criterion.
| | 04:34 | Then we can ignore these other options,
and then these ones are about Ensembles
| | 04:39 | and about Advanced, we can just ignore.
| | 04:41 | So the last thing I need to do is
going to go to Model Options and we don't
| | 04:46 | need to worry about these options.
We can just leave the defaults here.
| | 04:49 | So now we can come down to the bottom and
we can press Run to see what it gives us.
| | 04:54 | Automatic Linear Modeling produces
this one small chart and it doesn't look
| | 04:58 | like a huge amount, but this is a Model Viewer.
| | 05:01 | When you click on it, it's
interactive and it does a lot of other things.
| | 05:05 | So I'm going to double-click on this to
open up what's called the Model Viewer
| | 05:10 | window. Maximize that.
| | 05:13 | What you see here is first it says
what's the target variable, the thing that
| | 05:17 | we're trying to predict, and that is SPSS
and its relative importance as a search
| | 05:21 | term in Google on a state-by-state basis.
| | 05:24 | The Model Summary also tells us that
it's using automatic data preparation and
| | 05:28 | it's using a Forward Stepwise model
selection method for deciding which
| | 05:32 | variables go into the model.
| | 05:34 | Now, the bottom one the
information criterion has a number.
| | 05:37 | That's not really inherently meaning in
and of itself, but the lower the number,
| | 05:41 | that is, we have negative numbers, so
the greater the absolute value of the
| | 05:44 | negative number, the better the prediction.
| | 05:47 | Beneath that, where you show that we're able
to predict about 79% accuracy in this model.
| | 05:53 | So that's good.
| | 05:55 | What I'm going to do now is I'm
going to come over to the little list of
| | 05:58 | thumbnails on the left and start
going through these one at a time.
| | 06:01 | That's the one we're at right now.
| | 06:05 | The second one shows what the Automatic
Data Preparation did and what it is, is
| | 06:09 | that we have a lot of outliers and what
it's done is it's trimmed the outliers.
| | 06:13 | Actually, it didn't really trim them,
because trimming means throwing away that data.
| | 06:17 | Instead, technically what SPSS did is
something called Winsorising where it
| | 06:22 | takes the outliers scores and simply
replaces them with the highest or lowest
| | 06:26 | non-outlier scores.
| | 06:27 | So it brings them in.
| | 06:28 | This is a non-uncommon practice in
business setting, so it's a nice way to do it.
| | 06:34 | Also, when we have categorical
variables like the Region, SPSS is able to merge
| | 06:39 | categories in a way that
maximizes their predictability.
| | 06:43 | So that's a nice thing.
| | 06:45 | So that's what the
Automatic Data Preparation has done.
| | 06:47 | The third window shows us
what's called Predictor Importance.
| | 06:53 | Predictor Importance is actually a
rather sophisticated statistical calculation.
| | 06:58 | There are a number of things that go into it.
| | 06:59 | It's not just a matter of probability values.
| | 07:02 | It's not just a matter of correlations with the
outcome. There is much more to it than that.
| | 07:08 | But the relative importance is
a very easy thing to understand.
| | 07:12 | What this is telling us is that
there are three variables that have a lot
| | 07:16 | of importance in explaining the
levels of relative interest in SPSS as a
| | 07:21 | Google search term.
| | 07:22 | The first is the use of
Regression as a search term.
| | 07:26 | That's not surprising, because that's
a major thing that SPSS is used for.
| | 07:30 | The second one amazingly is Totally Lost,
which seems to show up a lot with SPSS.
| | 07:36 | The third one is the percent of
population with a Bachelor's degree or higher.
| | 07:40 | So these are the three major variables.
| | 07:42 | We're going to have more about those.
| | 07:44 | The next chart is the Diagnostic Plot.
| | 07:47 | It lets us know the observed value of
SPSS interest for each of the 51 states in
| | 07:54 | Washington, D.C., along with its predicted value.
| | 07:57 | The idea here is that they should stay
close together, that the observed and the
| | 08:00 | predicted should be pretty close.
Otherwise we don't need to worry about this.
| | 08:05 | This is a histogram of Residuals.
| | 08:07 | That's how far off the predictions were.
| | 08:09 | Again, if we had a thing that looked
really unusual here like a big spike at
| | 08:13 | one end or the other, we might have a problem,
but we're not going to worry about this one.
| | 08:17 | I'm going to scroll down a little
and I'll go to the next little page.
| | 08:22 | This is a list of particular outliers
and it tells us what their score was.
| | 08:26 | For instance we had one place that had a
score on SPSS of 3.364 and what that means
| | 08:33 | is that state showed a relative
interest in SPSS as a Google search term that
| | 08:37 | was 3.364 standard
deviations above the national average.
| | 08:42 | There is another measure that's related
called Cook's Distance and this doesn't
| | 08:46 | necessarily mean that these were
outliers in this absolute sense, but they are
| | 08:50 | the most extreme cases.
| | 08:52 | The next one down is a graph of the
effects of various predictor variables.
| | 08:58 | We have Regression as a search term
but transformed because it's removed the
| | 09:02 | outliers and then Totally Lost and then
Degree was also transformed by removing outliers.
| | 09:09 | This is a Diagram View.
| | 09:11 | You can also get a Table View and you can
even expand this to see the various terms.
| | 09:19 | If you need an analysis of variance
table for whatever purpose, here it is.
| | 09:23 | I'm going to skip over to the next
box and here we have coefficients.
| | 09:28 | The coefficients are the actual
numbers that you use to multiply things by.
| | 09:32 | The Intercept is in there and then we have
Regression, and Totally Lost, and Degree.
| | 09:37 | Please note the Degree 1 is a
different color because it's a
| | 09:39 | negative coefficient.
| | 09:41 | This would become clearer if we come
down and instead of having the diagram
| | 09:45 | we look at the table.
| | 09:47 | Here, we can now see the coefficients.
| | 09:49 | The Intercept, that is the standard
value that we give to everybody, is 0.87.
| | 09:54 | So we assume that a state is 0.87
standard deviations above the mean in
| | 09:59 | their interest in SPSS.
| | 10:01 | Then for every standard deviation
above on Regression, we add another half
| | 10:08 | of standard deviation.
| | 10:09 | For every standard deviation above on
Totally Lost, we add a little over a half 0.58.
| | 10:15 | On the other hand, for every
percentage point of the population that has a
| | 10:20 | Bachelor's degree or higher, we
subtract 0.03 standard deviations, and so this
| | 10:25 | is another way of looking at the
relative contribution of the variables.
| | 10:29 | I am going to scroll down a little further.
| | 10:31 | We have another one here that gives
estimated means charts and these are
| | 10:34 | straight lines, because these are
just the slopes of the lines that we give
| | 10:37 | in the coefficients.
| | 10:39 | I don't think there is anything terribly
important there, so I'll skip to the next one.
| | 10:43 | This is a table that shows us the
three variables that got included and then
| | 10:47 | across the top is the information criterion
and you can see that the number goes down.
| | 10:52 | It charts at -52 and when they
add Totally Lost, it goes to -73.
| | 10:55 | Now, it adds Degree.
| | 10:56 | It goes down to -75 and that was the
criterion for deciding whether to include a
| | 11:03 | variable, is whether it lowered
the value on information criterion.
| | 11:08 | The very last thing is just a quick summary.
| | 11:12 | You can click on to see what got
included and what the options were.
| | 11:19 | Just a quick written
summary of the entire model.
| | 11:22 | So the Automatic Linear Modeling
function in SPSS is a fabulous option for those
| | 11:27 | who want to make a sophisticated
analysis and have thorough reporting options
| | 11:32 | without having to make a
million decisions on their own.
| | 11:35 | It makes it much, much easier to sift
through a large dataset and see what
| | 11:40 | useful patterns might emerge.
| | 11:42 | I encourage you to spend some time to
check out all of its options because there
| | 11:45 | is more than I've covered here and
explore how it might be able to help you in
| | 11:50 | understanding your own data.
| | Collapse this transcript |
| Calculating multiple regression| 00:00 | In the last movie we covered SPSS's
new Automatic Linear Modeling function,
| | 00:06 | which takes a lot of the
stress out of statistical analysis.
| | 00:09 | It can also let you control almost
everything manually should you so desire.
| | 00:12 | On the other hand, you maybe using an
older version of SPSS that doesn't have
| | 00:16 | Automatic Linear Modeling, because
that's something that's new with version 19,
| | 00:21 | or you may want to include some
options in your analysis that it doesn't have,
| | 00:25 | such as something like
Hierarchical Blocking, which I use frequently.
| | 00:29 | In that case, you'll want to turn to
SPSS's Standard Linear Regression function,
| | 00:34 | which is what we'll discuss in this movie.
| | 00:36 | The goal of regression is pretty simple.
| | 00:39 | Take a collection of predictor
variables, multiply all of them by certain
| | 00:43 | weights called regression coefficients,
which are related to the impact that
| | 00:47 | each variable has on the outcome.
| | 00:49 | Add them all up and predict scores
on a single scaled outcome variable.
| | 00:53 | The actual work involved in this
process can of course get much more
| | 00:57 | complicated, but the
general concepts remain the same.
| | 01:01 | Now in this particular movie, we're
going to look at the most basic form of
| | 01:04 | multiple regression where all of the variables
are entered at the same time in the equation.
| | 01:08 | It is after all the variable selection
and entry that causes most of the fuss in
| | 01:12 | statistics, and here's how it works.
| | 01:15 | I'm going to be using the same Google
Search data set that's similar to the
| | 01:19 | marketing research people would be
trying to do in terms of ways of determining
| | 01:22 | the mind share of
particular ideas in Google searches.
| | 01:27 | What we need to do is go up to Analyze
and then down to Regression, and we're
| | 01:32 | going to go to the second choice here, Linear.
| | 01:35 | Linear means straight line.
| | 01:36 | It's going to try to put straight
lines through the data, and what we need to
| | 01:40 | do is get our one dependent or outcome
variable, the thing that we're trying to predict.
| | 01:44 | I'll use interest in SPSS as a search
term in Google, and then we pick the
| | 01:50 | independent variables, those things
that will be used to predict the levels.
| | 01:55 | I'm going to use a bunch of other search
terms from the Regression down through FIFA.
| | 02:00 | I'm also going to use
some dichotomous variables.
| | 02:03 | Whether they have an NFL team, and NBA team
or a Major League Soccer team. Put those in.
| | 02:09 | Scroll down a little bit.
| | 02:10 | The Percentage of the Population with a
bachelors degree or higher, whether they
| | 02:13 | have an outline for high
school statistics, the Median Age.
| | 02:18 | Now in the Automatic Linear Modeling I
was able to simply include a categorical
| | 02:22 | variable of the Census Bureau region.
| | 02:24 | It has four regions and that
procedure, Automatic Linear Modeling, was able to
| | 02:29 | compensate for the fact that we had four
different categories of no particular order.
| | 02:34 | In the Standard Linear
Regression we can't do that.
| | 02:37 | The predictors need to either be
scaled variables, they can't be ordinal
| | 02:40 | variables, or they need to be
dichotomous, 01 indicator variables.
| | 02:45 | Now when you have a categorical
variable, you don't need the same number of
| | 02:50 | indicator variables as you have categories.
| | 02:53 | The same way, for instance, to
indicate gender as either male or female we
| | 02:57 | only need one indicator.
| | 02:58 | If we want to indicate four different
regions in the United States, we only need
| | 03:02 | three indicator variables, because if
it's zero on all three of them, then the
| | 03:06 | fourth category is implied.
| | 03:09 | So I'm going to use these
three indicator variables.
| | 03:12 | Northeast, Midwest and South.
| | 03:14 | I'm going to add those as well.
| | 03:16 | Now let's come over for just a moment to
Statistics and see if there is anything
| | 03:21 | in here that we need for right now, and
there isn't. There are times when having
| | 03:25 | the R squared model change can be a very
handy statistic, but we're using what's
| | 03:29 | called Simultaneous Entry where we put
everything in the model at once so there
| | 03:33 | isn't a possibility of a change.
| | 03:35 | I'm going to hit Cancel.
| | 03:37 | These are some diagnostic
plots that we could get.
| | 03:40 | I don't think we need any of those.
| | 03:42 | If we wanted to save the predicted
scores or other diagnostic statistics, we
| | 03:47 | could do those with the Save menu.
| | 03:50 | We don't need any of these for right now.
| | 03:52 | Let's look at the other options.
| | 03:54 | Now these are criteria that are used
for entering and removing variables.
| | 03:59 | Now we're not using an automatic procedure.
We're simply entering everything at once.
| | 04:04 | If we wanted to replicate the
procedure that was used in Automatic Linear
| | 04:07 | Modeling, we would use a Forward
Stepwise Regression and then these criteria
| | 04:12 | for entry would matter.
| | 04:14 | But now we're not going to worry about them.
| | 04:16 | I'll just press Cancel now.
| | 04:17 | And so really we're just using the defaults.
| | 04:20 | I picked my one dependent variable,
which needs to be scale variables, and then
| | 04:23 | I put in a whole collection of
independent variables, and now I'll press OK.
| | 04:28 | And we get a bunch of tables out of this one.
| | 04:31 | The first table, which indicates
variables entered and removed, is not helpful.
| | 04:34 | You can just ignore that.
| | 04:36 | The second variable called Model
Summary gives what's called the Multiple
| | 04:40 | Correlation. The capital R in the
second column tells you what the correlation
| | 04:44 | is between all of the variables together.
| | 04:46 | It's an analog of the individual
correlation, which is usually lowercase r.
| | 04:51 | This is 0.937, which is a huge
correlation, considering it goes from 0 to 1.
| | 04:56 | The R squared, which is often a better
indicator, because you can read it as a
| | 05:00 | proportion of the variance in the
outcome that could be predicted by the
| | 05:05 | predictor variables, 88% is enormous.
| | 05:08 | The next one, the Adjusted R
squared, is also sometimes reported.
| | 05:11 | You'll see that it's smaller.
| | 05:13 | This has to do with the ratio of
predictor variables to the number of cases.
| | 05:17 | Now truthfully, I've probably used
more predictor variables than I should,
| | 05:20 | because really I only have 51 cases,
the 50 states in Washington, DC, but it
| | 05:26 | still works for my purposes.
| | 05:27 | The next table is the Analysis of
Variance Table and that provides a
| | 05:30 | statistical hypothesis test for whether the
entire model as a whole can predict at better than 0%.
| | 05:38 | And the answer of course is that yes.
| | 05:40 | I'm looking at the number that's on the
far right under Sig, where it says 000.
| | 05:45 | If that number is less than 05,
and this one isn't literally 0,
| | 05:48 | it's just less than 001, then the model
is statistically significant as a whole.
| | 05:54 | The table below that gives the
actual regression coefficients.
| | 05:58 | You have what are called Unstandardized
Coefficients, which were in the original metric.
| | 06:03 | So for instance, if it were years, that
says for every year add this much more to
| | 06:08 | your predicted value.
| | 06:10 | If it were dollar, say for every dollar,
then add this much to the predicted value.
| | 06:14 | Now the Google Search terms, which are
in quotes, those are already standardized
| | 06:19 | ones, but if you go down to Has
an NFL team or Has an NBA team.
| | 06:23 | So the one that Has an NFL team is .068
and what that says is for a state that
| | 06:29 | has an NFL team add .068 standard
deviations to the prediction of their interest
| | 06:37 | in SPSS relative to other
terms in Google searches.
| | 06:41 | Next to those is the standard error,
which is an indication of how spread out
| | 06:44 | the variation is, and if you take the
B weight or the regression weight and
| | 06:48 | divide it by the standard error, you
get to what's called a standardized
| | 06:51 | coefficients or a beta weight.
| | 06:53 | And those are actually really nice,
because those are similar to correlations.
| | 06:57 | They go from 0 to 1.
| | 06:58 | They can be positive or negative
and they indicate the degree of a
| | 07:02 | linear relationship.
| | 07:04 | Next to those are the T-tests.
| | 07:06 | Those are individual inferential
statistics for each one of the regression
| | 07:11 | coefficients, and next to
those is their significance level.
| | 07:14 | So we can go down to that column at the
end, the Significance levels, and look
| | 07:18 | for ones that are less than 05.
| | 07:20 | We see for instance that Regression is
a statistically significant predictor of
| | 07:25 | interest in SPSS as a search
term, so it's totally lost.
| | 07:29 | And if we scroll down, we see that
really those are only the two in that
| | 07:33 | collection that do it.
| | 07:34 | Now you may recall in Automatic
Linear Modeling we had three or four that
| | 07:39 | mattered, but that's because it
used a different procedure where it was
| | 07:43 | selective about what it entered and it
also had a different criterion and we are
| | 07:47 | seeing the overall changes
in the information criteria.
| | 07:50 | This time we're just using
probability values for individual
| | 07:53 | regression coefficients.
| | 07:55 | Now a really important thing here is
the beta coefficients I said are like
| | 07:59 | correlation coefficients.
| | 08:00 | That's true to a certain point, but
the big difference is that correlation
| | 08:04 | coefficients are only valid on their own.
| | 08:06 | Each correlation coefficient is
calculated separately with the outcome.
| | 08:10 | These, however, are only valid taken as a
group; each one of these influences the other.
| | 08:15 | So this can be very different from the
correlation coefficients and it can be
| | 08:20 | helpful to compare the two of them.
| | 08:22 | This is the most basic
version of multiple regression.
| | 08:26 | It doesn't have to be an impossibly
complicated rocket science affair.
| | 08:30 | Instead, it can serve a quick
insight into what could be a large and very
| | 08:35 | complicated data set.
| | 08:36 | It can give you some real clarity to start with.
| | 08:39 | The Automatic Linear Modeling
function can do a lot of this and a lot more
| | 08:43 | without too much direction from you,
but there are situations where you
| | 08:46 | would want to use the legacy command,
and I especially find the standardized
| | 08:50 | coefficients to be priceless, so I can
compare them with correlation coefficients.
| | 08:55 | I recommend that you take a little
time and see how SPSS's linear regression
| | 08:59 | feature can help you deal with
the complexities of your own data.
| | Collapse this transcript |
| Comparing means with a two-factor ANOVA| 00:00 | The last deferential test that we'll
look at in this course is a variation on
| | 00:04 | the Analysis of Variance or ANOVA or ANOVA.
| | 00:08 | As we discussed in the sections on
associations, the Analysis of Variance is
| | 00:12 | a very flexible and powerful
procedure and there are probably dozens of
| | 00:16 | permutations on it.
| | 00:18 | In this movie we're going to talk
about the version that is designed for
| | 00:21 | situations where two categorical
variables are used jointly to predict scores on
| | 00:27 | a scaled or quantitative outcome variable.
| | 00:30 | Because categorical variables are
generally referred to as factors in the
| | 00:34 | Analysis of Variance and the
categories that make them up are called levels,
| | 00:39 | this version of the Analysis of
Variance is usually called the Factorial ANOVA,
| | 00:44 | or more colloquially, a Two-Factor ANOVA.
| | 00:47 | An important thing to note is that
when you have two separate factors like
| | 00:50 | gender and educational category and
you're looking at levels of discretionary
| | 00:54 | spending, an Analysis of Variance
will give you three different results.
| | 00:59 | The first result will let you know
whether spending differs by gender,
| | 01:02 | ignoring educational level.
| | 01:04 | The second result will let you know
whether spending differs by educational
| | 01:08 | level ignoring gender.
| | 01:10 | These are both known as the main
effects where effect has to do with the
| | 01:14 | statistical association and main
because their factor has an effect on its own.
| | 01:19 | However, an Analysis of Variance also
gives you one more important result.
| | 01:24 | It lets you know whether
the two factors interact.
| | 01:27 | That is, it lets you know if for example,
women with college degrees spend more
| | 01:32 | than women without college degrees, but
for men, their spending is the same with
| | 01:35 | and without a degree.
| | 01:37 | By the way, I'm just making that up.
I don't really know what the association
| | 01:40 | between those variables is, but I'm
sure that some of you actually do.
| | 01:44 | In some domains, the interactions are
particularly interesting and can take
| | 01:48 | precedence over the main effects.
| | 01:50 | However, it all comes down to
interpretability and applicability and that will
| | 01:55 | depend on what you are
trying to do with your data.
| | 01:58 | With that in mind, let's see how a
Two-Factor ANOVA can work in SPSS.
| | 02:03 | To do the Analysis of Variance, we
need to go to Analyze and down to
| | 02:07 | General Linear Model.
| | 02:09 | Now that actually is an interesting term,
and the idea here is that all of the
| | 02:12 | procedures that we've done, T-Tests and
Regression and Multiple Regression are
| | 02:17 | all variations and once called a
General Linear Model, a way of predicting
| | 02:21 | scores on a single outcome.
| | 02:23 | Let's do this one over here, Univariate.
| | 02:27 | Now what do we need to do is
pick our main dependent variable.
| | 02:31 | that's the outcome variable, this
thing that we're trying to predict.
| | 02:35 | In this particular example, I thought
I might use interest in NBA as a search
| | 02:40 | term, so I'll put that up
in the dependent variable.
| | 02:43 | And then I'm going to use two
categorical variables as predictors of interest
| | 02:48 | in searching for NBA.
| | 02:50 | The first one that makes a lot of sense
to me is whether a state has an NBA team.
| | 02:55 | So I'll put that here under Fixed Factor(s).
| | 02:58 | When you have categories that are
determined like yes or no, they have an NBA
| | 03:02 | team, then it's a fixed factor.
| | 03:04 | You can also have what are called
random factors in the Analysis of Variance,
| | 03:08 | but in many situations, those are
unusual and I've never used them.
| | 03:12 | A covariate there is if you want
to throw in another quantitative or
| | 03:16 | scaled variable, by putting
covariates into analysis can complicate the
| | 03:20 | results dramatically.
| | 03:22 | The last one is if you want the Weight
Cases and we're not going to deal with that.
| | 03:25 | I'm just going to go back and find my
second predictor category and that's
| | 03:29 | going to be region of the United States.
| | 03:32 | And I can just click that
one and put it in there.
| | 03:34 | Now it's okay that there are four
levels in this category. The Analysis of
| | 03:37 | Variance is able to deal with that just fine.
| | 03:39 | Let's take a quick look at
some of the options here.
| | 03:42 | Under Model, I can specify
whether I want something called a full
| | 03:46 | factorial model or custom.
| | 03:47 | We don't need to worry
about that. We can Cancel.
| | 03:50 | Under Contrasts, I can try to decide if
there's special ways I want to compare
| | 03:55 | the results, and I don't
need to worry about that.
| | 03:58 | Under Plots, I could get Profile Plots,
but these can get a little complicated,
| | 04:03 | so I'm going to cancel that.
| | 04:04 | Post-HOC lets me look at the
differences more effectively. I'm not going to
| | 04:09 | do that on this one.
| | 04:11 | If I want to save the predicted
values or if I want to save some other
| | 04:15 | statistics for diagnostics, I could do
that, but I'm going to skip it for now.
| | 04:19 | And finally under Options, there
are some here that I might want to do.
| | 04:22 | I might want to get what are called
descriptive statistics and estimates of effect size.
| | 04:27 | I think those two are really helpful.
| | 04:28 | Then I'm going to press Continue.
| | 04:31 | And I've got it set up the way
I need, so I'll just click OK.
| | 04:35 | And so here are my results.
| | 04:37 | The first thing is I get an
indication of what are called the
| | 04:40 | Between-Subject Factors.
| | 04:42 | These are the things that
separate one group from another.
| | 04:44 | One factor is whether a state has an
NBA team and you can see that 23 of them
| | 04:49 | do and 28 of them don't.
| | 04:52 | The second thing is the Census Bureau region.
| | 04:55 | You see that I have nine states in the
Northeast, 12 in the Midwest, and so on.
| | 04:59 | Below that, I have the actual descriptive
statistics for the search interest in NBA.
| | 05:07 | Well, it's breaking it down by
whether they have an NBA team and by the
| | 05:11 | Census Bureau region.
| | 05:12 | So the states in the Northeast that do
not have NBA teams have a mean of
| | 05:17 | minus .42. That means that they are
about half a standard deviation below the
| | 05:22 | rest of the country in relative
interest in searching for NBA teams.
| | 05:26 | On the other hand, if you go to the
Northeast teams that do have NBA teams, you
| | 05:31 | see that they have a score of +.39.
| | 05:35 | That means they're about four-tenths of
a standard deviation above the national
| | 05:39 | average in relative interest
in searching for NBA on Google.
| | 05:43 | And then you can run through and
see the various combinations there.
| | 05:47 | The next table is the actual analysis
of variance table, and what it has is
| | 05:52 | several different results here.
| | 05:53 | The first one that says Corrected Model
simply tells me how well the model as a
| | 05:59 | whole works and it predicts rather nicely.
| | 06:01 | You can see that it has a
Significance level in the first row of 000.
| | 06:06 | And it also has something
called a Partial Eta Squared.
| | 06:09 | Again, it's like a correlation
that's squared and it's .492.
| | 06:13 | In fact, if you look at the footnote at
the bottom of that table, you'll see it
| | 06:16 | says R Squared = .492.
| | 06:19 | And what it means is that if we know
the region of the country that a state is
| | 06:22 | in and whether that state has an NBA
basketball team, then we can accurately
| | 06:27 | predict about 50% of the variance in
interest in NBA as a Google search term.
| | 06:34 | So that's the entire model.
| | 06:35 | The next step down on that table is
Intercept and that just means that the
| | 06:40 | starting score is not 0 and that's not
terribly interesting in and of its own.
| | 06:45 | What's funny here is that
it actually is close to 0.
| | 06:47 | The next one is whether a state has an
NBA team, has_nba, and you can see there
| | 06:53 | that it's highly significant.
| | 06:55 | Their probability value is 000
and the Partial Eta Squared is .412.
| | 07:01 | And what this lets us know is that most
of the interest in NBA as a search term
| | 07:06 | has to do with whether a state has an NBA team.
| | 07:10 | So that's a major predictor.
| | 07:12 | The next one is region.
| | 07:14 | Is there region by region interest?
| | 07:16 | The significance level is .079 and
that's above the standard cutoff of 05, so
| | 07:22 | we would say that on the whole, no, the
region that a state is in does not make
| | 07:26 | a big difference in terms of their
interest. On the other hand, whether they
| | 07:30 | had an NBA team did.
| | 07:31 | Those are the two main effects
that an Analysis of Variance gives us.
| | 07:36 | There is however the third
thing that I talked about: the statistical interaction.
| | 07:41 | And that is whether the region interacts
with whether a state has an NBA team to
| | 07:45 | predict overall interest.
| | 07:46 | And you see that on this one, the
significance level on the second to last
| | 07:50 | column, the last entry is .049, which
is just barely beneath the 05 cutoff and
| | 07:57 | there's enough to be
considered statistically significant.
| | 08:00 | Now what we're going to need to do is
very quickly make a chart to show what
| | 08:04 | these differences look like.
| | 08:05 | I'm going to do that
really quickly in the graph.
| | 08:09 | Go to Graphs, to Chart Builder,
I'll get a Clustered Bar Chart.
| | 08:15 | And from there I'll take NBA as an
interesting search term and I'll take
| | 08:21 | whether they have an NBA team, I'll make that
cluster and I'll put the Region on the X axis.
| | 08:29 | And when I do that, you
see what's going on here.
| | 08:34 | The bars in green are for states that
have an NBA team and you see every region
| | 08:40 | where they have NBA teams, there are
above-average interest in searching for
| | 08:43 | NBA, and it makes sense.
| | 08:45 | The states that don't have NBA teams are
in blue and they'll have below average
| | 08:50 | interest, regardless of the region,
except you do see an interesting thing.
| | 08:54 | In the South, the states that have NBA
teams, and there are several, are barely
| | 08:59 | above the national average in terms of interest.
| | 09:03 | But in the West, the states that
have NBA teams have huge amounts of
| | 09:08 | interest, much higher.
| | 09:09 | And so you can see that the effect of
having an NBA team varies according to region.
| | 09:15 | And that's the idea of a
statistical interaction.
| | 09:18 | it's one of the benefits
of an Analysis of Variance.
| | 09:21 | And so, for our final inferential test,
the Factorial Analysis of Variance, you
| | 09:26 | see this is an excellent way of
looking at the association between two
| | 09:30 | categorical predictor variables in
a single-scaled outcome variable.
| | 09:35 | It lets you look at the statistical
effect of each of the categorical variables
| | 09:39 | on its own, as well as the
interaction of the two, which can often be more
| | 09:43 | interesting and more important.
| | 09:45 | And with that, we'll conclude our last
section on statistical graphing and testing.
| | 09:50 | In the next and last section, we'll
wrap things up a little and talk about how
| | 09:54 | you can get all of your results out of
SPSS and format them, so they'll be as
| | 09:58 | clear and as communicative as possible.
| | Collapse this transcript |
|
|
11. Formatting and Exporting Tables and ChartsFormatting descriptive statistics| 00:00 | In the last several dozens movies,
we have talked about ways that you
| | 00:03 | could explore your data with
graphics and descriptive statistics and
| | 00:07 | inferential procedures.
| | 00:09 | And while that's a great way for
you as the analyst to get a thorough
| | 00:12 | understanding of your data, if you
really want your analysis to accomplish
| | 00:16 | something useful you will
have to communicate it to others.
| | 00:19 | Now we've already discussed ways to
modify charts as we have covered these charts;
| | 00:22 | however tables can be an
important part of communicating information.
| | 00:27 | In fact, when I'm writing a research
report I try to put all of the results and
| | 00:31 | graphs and tables and then use the text
to simply describe the patterns without
| | 00:35 | including the numbers there.
| | 00:36 | In this movie, we will look at a way to
format your tables to make them easier
| | 00:40 | to follow and easier to communicate to others.
| | 00:43 | In the next one, we will talk about
ways to show correlation matrices and the
| | 00:48 | results from regression analysis, and
then finally we will have a movie that
| | 00:51 | talks about how to export tables for
use in other programs like word processors
| | 00:56 | and spreadsheets and
presentation software and webpages.
| | 00:59 | For this example, I am going to be
using the Google searches information,
| | 01:03 | searches.sav that I have used in several others.
| | 01:06 | I am going to start by getting
some descriptive statistics here.
| | 01:09 | I am just going to come up to Analyze,
to Descriptive Statistics to Frequencies,
| | 01:16 | and what I am going to do is I am
going to get the information about several
| | 01:19 | variables that I could use, for
instance, try to predict people's interest in
| | 01:23 | SPSS as a search term in Google.
| | 01:26 | I find it helpful to begin
with the outcome variables.
| | 01:29 | We will take SPSS and move that over.
| | 01:31 | I might want to include Business
Intelligence and Data Visualization.
| | 01:35 | I might also want to include my
Education Variable, the Percentage of each
| | 01:39 | State's Population with a Bachelor's
Degree or Higher, and then I might want
| | 01:42 | to include the Age.
| | 01:43 | Now you see these are all scale
variables. We got a little measuring stick
| | 01:47 | right next to each one.
| | 01:48 | I would also want to use these three
region variables, but because those are
| | 01:54 | dichotomous indicator variables I don't
need the same kinds of statistics for them.
| | 01:58 | So I am going to skip them for right now.
| | 02:00 | Then what I am going to do is I am
going to choose the statistics that I want,
| | 02:04 | I want the Mean and the Standard
Deviation and then I want what's called the
| | 02:07 | Five number summary.
| | 02:09 | That's the five quartile scores,
the Minimum, the Maximum, the first
| | 02:13 | quartile, the second quartile,
which is also the median or the 50th
| | 02:16 | percentile, and the third quartile.
| | 02:19 | I get those by clicking on the Minimum,
the Maximum and Quartiles, and now I am
| | 02:23 | ready. I can press Continue.
| | 02:25 | And I don't want the frequency tables
and I don't need any charts right now, so
| | 02:29 | I am just going to press OK.
| | 02:31 | And there we go. I have a short table.
| | 02:33 | This is pretty easy to follow; however,
there are too many decimal places and
| | 02:37 | some of the statistics are out of order
and I don't like the way the labels are.
| | 02:41 | The easiest way to take this as to
simply right-click on the table and copy it,
| | 02:46 | and once that's copied I can go into
Microsoft Excel and I'm going to go to the
| | 02:51 | second column and I'll paste the table there.
| | 02:54 | The reason I used the second column is
because I find it very helpful to have
| | 02:58 | one column that can maintain
the original order of things.
| | 03:02 | I just type in a couple of numbers and
then I can drag down and propagate the list.
| | 03:06 | Then I can start deleting
information that I don't need.
| | 03:09 | I don't need this title. This is statistics.
| | 03:13 | I do want to rearrange and use
different names for the statistics that are in
| | 03:17 | columns B and C. However, you'll see
that SPSS has merged some of the cells
| | 03:22 | which makes it harder to deal with.
| | 03:23 | So what I'm going to do is I'm going to
insert a new column and I will just call
| | 03:28 | it Statistics, and then I'll
put the names of the statistics.
| | 03:34 | You may want to call them different things.
| | 03:35 | I have a particular set of
abbreviations I frequently use.
| | 03:38 | N is common for the sample size, and
Missing I'm going to delete in a moment so
| | 03:42 | I am not even going to add that.
| | 03:43 | M for Mean, SD for Standard Deviation.
| | 03:47 | Then the next five numbers are quartiles.
| | 03:49 | Now I have a personal preference. This is
not a common way of doing it but I like it.
| | 03:54 | I refer to them as Q0 through 4.
| | 03:57 | So the Minimum is Q0 because it's the
0th quartile. There's nothing below it.
| | 04:02 | The Maximum is Q4 because everybody is below
it, and the other ones are Q1, Q2, and Q3.
| | 04:10 | And once I have got those, I can
actually take these two columns right here
| | 04:14 | and I can delete them.
| | 04:16 | Now the only problem is that
these statistics are out of order.
| | 04:19 | We have 2, 3, 4, 5, 6, 7, but then
these ones need to be slightly different.
| | 04:25 | I can get that if I just change this
one to a 12, and then I select this column
| | 04:30 | and sort, and now the Q4 goes to the bottom.
| | 04:33 | I don't need this column
anymore. I can delete it.
| | 04:36 | Then I can delete the outlines around here.
| | 04:40 | I can center everything.
| | 04:42 | I can make these columns slightly
wider and now I am going to deal with the
| | 04:49 | issue of decimal places.
| | 04:51 | I don't need this many decimal places
for the Percentage of the Population
| | 04:55 | with the Bachelor's Degree or Higher
and the Median Age. I think it's okay to
| | 04:58 | have these two statistics, the Mean and
Standard Deviation, go down to two decimal places.
| | 05:03 | That's usually adequate for most purposes.
| | 05:05 | And then for quartile statistics I
actually prefer to take them down to no
| | 05:10 | decimal places, and then over here for
the three Google search terms we do have
| | 05:15 | a separation issue in that.
| | 05:16 | These are numbers that
inherently have a lot of decimal places.
| | 05:19 | So what I'm going to do is I'm
going to bring all of these down to two
| | 05:23 | decimal places as well.
| | 05:24 | I am going to delete this
column for the missing variables.
| | 05:30 | You can arrange things slightly
differently, but what I want you to see is by
| | 05:33 | copy and pasting from SPSS into Excel,
it gives me a lot more flexibility in
| | 05:38 | terms of rearranging things, changing
the decimal places, renaming, and I can
| | 05:43 | take the information and put it
manually into a form that I feel is going to be
| | 05:47 | easier to communicate to others.
| | 05:48 | Now in the next video, I am going show
you how to deal with the table of results
| | 05:53 | from a correlation and then from
regression, and you can combine these to make a
| | 05:58 | overall presentation of your data.
| | Collapse this transcript |
| Formatting correlations| 00:00 | In the last movie, we looked at how to
take a table of descriptive statistics
| | 00:04 | in SPSS and then copy and paste it
into a spreadsheet, and then in that
| | 00:09 | spreadsheet to rearrange, delete, and
modify the values in there to make them
| | 00:14 | easier to communicate.
| | 00:16 | In this movie, I want to show you how
to take one particular kind of table, a
| | 00:20 | correlation matrix, and work with that
in a spreadsheet to clean it up and make
| | 00:25 | it much easier to deal with, where you
can go from potentially thousands of
| | 00:29 | numbers to a small handful and
present them in a way that makes them much,
| | 00:33 | much easier to follow.
| | 00:34 | For this example, I'm going to be using
the same dataset and the same variables
| | 00:38 | I did in the last one, the Google
searches information and searches.sav.
| | 00:42 | And the first thing I need to do is get
a correlation matrix, so I'll come up to
| | 00:47 | Analyze, to Correlate,
to Bivariate Correlations.
| | 00:52 | Now I find it helpful to take the
outcome variable and put that in first so it
| | 00:56 | shows up in the left column.
| | 00:58 | In this case, that's the relative
interest in SPSS as a Google search term.
| | 01:03 | The other terms that I used were
Business Intelligence and Data Visualization.
| | 01:08 | I also used an indication of
education with the percentage of the state's
| | 01:13 | population with a bachelor's degree or higher.
| | 01:15 | I used the Median Age and then I used
three indicator variables for the region
| | 01:21 | of the United States.
| | 01:22 | Now even though there are four regions
with indicator variables, you only need
| | 01:27 | one less indicator than
the number of categories.
| | 01:30 | So for instance, when we have the two
categories of gender, we only need a
| | 01:33 | single indicator variable
to indicate one or the other.
| | 01:36 | With four categories, we only need
three because the fourth category is implied
| | 01:41 | by zeros on the three variables.
| | 01:43 | But I can highlight the three of those
and move them over and now I just click OK.
| | 01:49 | Now I have a correlation matrix here and as
far as correlation matrices go it's not huge.
| | 01:53 | I've had ones with hundreds
of variables on each side.
| | 01:56 | But you see that we have the
variables listed down this side and the same
| | 01:59 | variables across the top, and we have
several statistics in the cells for each one.
| | 02:04 | Please note that at this point the
statistically significant correlations have
| | 02:08 | Asterisks next to them.
| | 02:11 | What I'm going to is I'm going to
right-click on this table and copy it.
| | 02:14 | Then I'm going to go to a spreadsheet.
| | 02:17 | I'm using Excel in this particular case
and I'm going to paste this not into
| | 02:23 | cell A1 but into B1.
| | 02:29 | And the reason I'm going to do that is
I find it very helpful to have an index
| | 02:33 | column at the beginning that allows
me to restore the order of things.
| | 02:37 | So I have 1, 2. I can select those and drag
down and propagate the order list. Great!
| | 02:45 | And now what I can do is I can
start deleting and reformatting.
| | 02:48 | So for instance, you see in row one,
the word Correlations is a single merged cell.
| | 02:52 | That's going to make
it difficult to sort things.
| | 02:55 | So I'm going to simply delete that.
| | 02:59 | Then you can see that in column B, the
search terms are merged cells across three rows.
| | 03:04 | This also causes problems.
| | 03:06 | The way to deal with that is
to simply delete the column.
| | 03:12 | So I've lost the names of the variables
but I can get those back because I have
| | 03:16 | the same variables listed across the top.
| | 03:18 | However, I don't need the Pearson correlation
and the probability level and the sample size.
| | 03:23 | All I really want is the correlation,
so I'm going to get rid of the other two.
| | 03:28 | Simply click on a cell in that row
and then I can sort the entire table.
| | 03:33 | Now I have the Ns. They're all 51,
so I don't need those in my table.
| | 03:38 | Then I have the Pearson Correlations,
then I have the Sig. (2-Tailed).
| | 03:42 | Those are the probability levels.
| | 03:44 | I don't need those.
| | 03:45 | I need to indicate them in a way and
I'm going to delete them for right now.
| | 03:50 | So now all I have are the
correlation coefficients themselves.
| | 03:54 | I'm going to sort this again to
try to get the titles on the top.
| | 04:00 | I'm going to cut this and then
insert it back beneath the titles.
| | 04:05 | Then in order to get the variable list
back on the side where it says Pearson
| | 04:09 | Correlation, I highlight the list here,
I copy that, I come back to this first
| | 04:16 | one and right-click, and I do
Paste Special and Transpose.
| | 04:21 | And that switches it
from horizontal to vertical.
| | 04:25 | And so you see now I have
the variables listed again.
| | 04:27 | Now I'm going to do something else.
| | 04:29 | I don't need all of these variables here.
| | 04:32 | I'm mostly interested in just
predicting SPSS, so I can highlight all of those
| | 04:37 | and I can delete them.
| | 04:39 | Also I don't need the
SPSS correlated with itself.
| | 04:43 | Now I can remove the borders.
I can get this one flush left.
| | 04:48 | I'm going to stretch this out a little
bit, but this one is too long so I'm
| | 04:53 | just going to call it Degree, and I'll
make these other two a little shorter
| | 04:59 | and center this one.
| | 05:02 | I don't need three
decimal places. Two is plenty.
| | 05:06 | But now I need to indicate which
ones are statistically significant.
| | 05:09 | I'm going to delete this column also.
| | 05:14 | Unfortunately, we had asterisks in the
SPSS table to indicate which correlations
| | 05:19 | were statistically significant, but we
lost them when we pasted into Excel.
| | 05:24 | That's not a big problem though.
| | 05:25 | Wwe could go back and manually check,
but I know another way of doing this.
| | 05:30 | I've provided a spreadsheet called
Correlation-Probability-Formulas and what
| | 05:36 | you can do with this one is
you simply enter the sample size.
| | 05:39 | in this particular case we have 51, and
it will tell you what absolute value of
| | 05:44 | correlation is statistically significant.
| | 05:46 | In this case, it's 276.
| | 05:49 | So anything that is greater than an
absolute value of 276, so negative that goes
| | 05:54 | past or positive that goes past
it, is statistically significant.
| | 05:58 | So I can go back to my table here and I
can do a quick conditional formatting.
| | 06:04 | Now it's a little silly when I
only have seven numbers here.
| | 06:07 | But the point is this works just
as well as thousands of numbers.
| | 06:10 | I highlight the numbers, I come over to
Conditional Formatting, I click on that,
| | 06:15 | and I'm going to create a new rule.
| | 06:19 | And I want to format only cells that
contain values that are not between
| | 06:24 | -0.276 and positive 0.276.
| | 06:30 | So the values have to be more extreme than that.
| | 06:33 | Then I go to Format and I can choose Fill and
maybe I'll make them yellow and I press OK.
| | 06:42 | And when I do that, I see that the top
three correlations are all statistically
| | 06:46 | significant because they have
absolute values greater than 0.276.
| | 06:51 | Now it's also helpful to create a
legend and highlight it in the same color, so
| | 06:59 | it's clear that that color means something.
| | 07:02 | If I want to, I can put a border around this.
| | 07:05 | Many of you will have training in
designing graphics and you'll find ways to
| | 07:10 | make this even clearer.
| | 07:11 | But what I've done here is
I've taken-- let's look back at the
| | 07:14 | original correlation matrix.
| | 07:17 | It's huge. There's hundreds of numbers here.
| | 07:21 | And I've boiled it down to seven numbers
and even then I've highlighted the ones
| | 07:25 | that are statistically
significant to make it easier to find.
| | 07:28 | So this is one way to take the output
of SPSS and to transform it into a way
| | 07:34 | that makes it easier to
communicate and easier to understand.
| | 07:38 | In the next video, I'm going to show
you how to integrate the results of a
| | 07:42 | regression analysis to compare this
and try to make the patterns clear across
| | 07:46 | the two ways of analyzing the data.
| | Collapse this transcript |
| Formatting regression| 00:00 | In the last two movies, we've looked
at ways to take output from SPSS and
| | 00:05 | reformat it by pasting it into a
spreadsheet and working with it to get it so
| | 00:10 | it's clear, simpler, and easier to communicate.
| | 00:13 | In the first movie, we looked at
formatting a table of descriptive statistics.
| | 00:17 | In the second one, we looked at how
to deal with a correlation matrix.
| | 00:22 | In this third one, I want to show you
how to take the results of a multiple
| | 00:25 | regression and compare them with the
results of correlation coefficients, as a
| | 00:31 | way of communicating the different
perspectives that these analyses can give
| | 00:35 | you and to make it clearer how to
interpret them in a meaningful way.
| | 00:40 | To do this, I'm going to be using the
same data sets, Google searches, and the
| | 00:44 | same variables that I used
in the last two examples.
| | 00:47 | I need to get a linear regression output.
| | 00:49 | To do this, I come up to Analyze
and go to Regression, to Linear.
| | 00:55 | I need to take my dependent variable.
| | 00:57 | That's my outcome variable or
the thing I'm trying to predict.
| | 00:59 | That's SPSS and I put that into Dependent.
| | 01:03 | Then I take all the variables that I
want to use as my predictors, the things
| | 01:07 | that I think will explain interest in SPSS.
| | 01:11 | And in this case, I'm going to be using
the same ones that was used before, searches
| | 01:14 | for Business Intelligence,
searches for Data Visualization.
| | 01:18 | And then I'm going to come down to the
degree, Percentage of a state population
| | 01:25 | with Bachelors Degree or more, the
Median Age, and then my three dichotomous
| | 01:30 | indicators for Region.
| | 01:33 | Now I've mentioned before that Region
has four categories and the reason we
| | 01:38 | used three indicator variables for
this is because the fourth category, which
| | 01:44 | would be West, is implied by 0s in all of these.
| | 01:47 | In the other analyses, it's okay to
have a fourth indicator for West, but in
| | 01:51 | linear regression it's not.
| | 01:53 | That introduces something called
multi-co-linearity and it can really wreak
| | 01:57 | havoc with the result if you have variables
that are correlated entirely with each other.
| | 02:02 | So that's why we don't do that.
| | 02:04 | Now to make this one simple,
I'll leave it as Enter.
| | 02:07 | That means it's going to give me a
regression coefficient for all of these at once.
| | 02:11 | I just leave everything at
the default and I press OK.
| | 02:16 | And I have a number of statistics here.
The one I'm going to go to right now is
| | 02:20 | this one that says Coefficients.
| | 02:22 | Really there is one column
here that's of most interest.
| | 02:25 | it's the one that says
Standardized Coefficients Beta.
| | 02:28 | It's third from the right.
| | 02:29 | There's an inferential statistic next
to it, the T-Test, and then there's a
| | 02:33 | Significant value next to that.
| | 02:35 | What I really want is the Beta
Coefficients, because those are the ones that are
| | 02:39 | most comparable to correlation coefficients.
| | 02:43 | And then I'm going to indicate the
statistical significance by highlighting the
| | 02:46 | ones that are significant.
| | 02:48 | I'm also going to use some of the
information from the two tables above that,
| | 02:53 | the Model Summary and the ANOVA.
| | 02:54 | I'll show you those in a moment.
| | 02:56 | So what I'm going to do is I'm going
to right-click on my Coefficients table,
| | 03:00 | copy it, and I'm going to go to the
same Excel spreadsheet that I used for
| | 03:04 | modifying the correlation coefficients,
except for this moment I'm going to
| | 03:08 | start with the second sheet.
| | 03:10 | I'll go to B1 and paste the results in.
| | 03:14 | Again, because that allows me to
put in a column, so I can reconstitute
| | 03:19 | the order if I need to.
| | 03:21 | And then I'm going to start
getting rid of some information.
| | 03:23 | I don't need this merged cell
that says Coefficients on the top.
| | 03:27 | I don't need this giant merged
cell that says Model here on the side.
| | 03:31 | And then I don't need this one
that says t and I don't need the
| | 03:40 | Unstandardized Coefficients.
| | 03:41 | So these are the ones in the original
metric, but I'm just going to leave those
| | 03:44 | out for right now, because the
standardized coefficients, which are also called
| | 03:48 | the Beta Weights, are the ones that
are most easily compared with the
| | 03:53 | correlation coefficients.
| | 03:54 | Now the Constant, the Intercept
term, doesn't have a standardized
| | 03:58 | regression Beta Weight.
| | 03:59 | That's fine, so we can just leave that out.
| | 04:01 | And in fact, what I'm going to
going to do is I'm going to put here
| | 04:05 | Predictor, Beta, and then I'm going to put p
right here, and I don't need one for the Intercept.
| | 04:13 | That way I can delete these merged cells
up here and I have just these ones left.
| | 04:20 | I don't need to worry too much about
the formatting of the labels here, because
| | 04:27 | I'm going to use the ones on the other page.
| | 04:30 | In the last one, I highlighted everything
that was statistically significant in the 05.
| | 04:35 | I'm also going to highlight the ones
here that are statistically significant.
| | 04:40 | An easy way to do that is to come
in here to the p values and sort.
| | 04:45 | And so now all the small p values,
the ones that are statistically
| | 04:48 | significant, are right here.
| | 04:50 | And then I can highlight those and then
if all goes well, I can sort them again.
| | 04:57 | Now I can delete the p values. All I
need are these ones, and I'm going to
| | 05:04 | copy those and I'm going to go to the first
page where I have my correlation coefficients.
| | 05:13 | And I just want to make sure that
everything is in the same order. It is.
| | 05:18 | These I need to say are
correlations and these are beta coefficients.
| | 05:25 | A beta coefficient is a standardized
regression coefficient, and then here I've
| | 05:31 | got Predicting SPSS.
| | 05:35 | And so now what I have, I'm going to
remove the borders that I actually put in
| | 05:39 | earlier, and I'll get those all centered.
| | 05:45 | Here's an interesting thing.
| | 05:46 | The correlations and the beta
coefficients, I'm going to change the decimal
| | 05:50 | places here, are approximately the same thing.
| | 05:54 | Now what's interesting about putting
the correlation coefficients in one column
| | 05:58 | and the beta coefficients next to them
is you can see actually that there's a
| | 06:01 | huge contrast between the two of these.
| | 06:04 | In the correlations, we had three
variables that individually had high
| | 06:08 | correlations with the relative
interest in SPSS as a Google search term.
| | 06:13 | They were Business Intelligence, Data
Visualization, and the proportion of a
| | 06:17 | state's population that had degrees.
| | 06:19 | All three of those are significantly
and positively correlated, and the age and
| | 06:24 | the region variables were not.
| | 06:26 | However, when we go over to the regression
results, we get a very different pattern.
| | 06:31 | For one thing, Business
Intelligence is no longer significantly
| | 06:34 | correlated, where there's gone
negative but it's not significant, so we'll
| | 06:37 | treat it as functionally 0.
| | 06:39 | Degree has also gone negative,
but it's not significant.
| | 06:42 | Data visualization on the other hand is
still statistically significant and it
| | 06:47 | has actually gone much, much higher.
| | 06:49 | Beta coefficients are like correlations
and that they go from 0 to 1. They can
| | 06:53 | be positive or negative.
| | 06:54 | This is almost as strong as it can be.
| | 06:57 | Data Visualization becomes a huge predictor.
| | 06:59 | And then what's really shocking is
that this three region variables, which
| | 07:04 | individually had no correlation with
interest in SPSS, all three of them had
| | 07:08 | become statistically
significant in the regression coefficient.
| | 07:12 | What this lets us know is that region as
a whole does matter and mostly because
| | 07:17 | the three of these are contrasting with
the West, we would want to look at the
| | 07:21 | relative interest of SPSS in the four regions.
| | 07:24 | The other thing to keep in mind is
that the correlation coefficients are
| | 07:28 | valid individually.
| | 07:30 | The correlation of Business Intelligence
to SPSS of .49 is calculated on its own.
| | 07:35 | The next one down between Data
Visualization and SPSS, where we have
| | 07:38 | a correlation of .60,
| | 07:40 | that's correlated on its own.
| | 07:42 | However, for the regression
the seven beta coefficients are
| | 07:46 | calculated simultaneously.
| | 07:49 | If we removed any one of these,
all of the others would change.
| | 07:53 | They're taken as a combination and their
values and their probability values are
| | 07:58 | only valid when taken as a group.
| | 08:01 | And so that's one of the reasons why I
can get very different patterns when you
| | 08:05 | put in a linear regression
result versus a correlation.
| | 08:09 | Now there's one other thing I want
to add for the linear regression.
| | 08:13 | And that is this thing up here, under
Model Summary where it gives the R Squared.
| | 08:19 | And that is an indication of the
proportion of variance in the outcome
| | 08:23 | variable, which is SPSS searches
that can accurately be predicted by the
| | 08:27 | combination of the other variables.
| | 08:29 | And what we have here is an R Squared of
.589 and what that means is that nearly
| | 08:34 | 60% of the variance in SPSS and just
can be predicted by these other seven
| | 08:40 | variables collectively.
| | 08:42 | So I'm going to take that .589, I'm
just going to insert a row, and I'll label
| | 08:47 | it R Squared, and I'm going to put down
the .589. I'll just round it off right
| | 08:52 | now and you can actually
put that down as a percentage.
| | 08:55 | And I'm going to leave it highlighted,
I'll change that one to a percentage,
| | 09:00 | and I'm going to leave it
highlighted in yellow, because it is
| | 09:04 | statistically significant.
| | 09:05 | What that means is it's different from
the 0 and the way I can tell that is by
| | 09:09 | the result in the next table of the
Analysis of Variance table where the model
| | 09:13 | as a whole has a significant value
of less than .000 here, but .001.
| | 09:18 | And so I know that that R Squared
value of .589 is statistically significant.
| | 09:23 | What I have here is a result that says
that those seven variables collectively
| | 09:28 | predict a lot of the interest
in SPSS as a Google search term.
| | 09:33 | What's funny about it is that the
pattern from the individual correlations to
| | 09:37 | the combined regression
coefficient changes dramatically.
| | 09:41 | And it's not the case that one of these
is accurate and the other is inaccurate.
| | 09:45 | They are both accurate; they are just
very different perspectives on the issue,
| | 09:49 | the individual versus the group predicting.
| | 09:52 | Anyhow, this can be one step in trying
to tell an analytic story about your data.
| | 09:57 | It can get complicated.
| | 09:58 | it can require some insight and some
judgment in how best to interpret it.
| | 10:02 | But this is a way of taking a huge
amount of numbers and a huge number of
| | 10:07 | tables and boiling them down to a
very small concise way of presenting the
| | 10:11 | results that I think it makes it much
easier for you to articulate your story,
| | 10:16 | your vision of your data analysis.
| | Collapse this transcript |
| Exporting charts and tables| 00:00 | In this final video, I want to show you
how to take the charts that you create
| | 00:04 | in SPSS and export them as HTML and as
image files as either JPEG or PNG or some
| | 00:10 | other format that you can then use
to integrate into your word processor
| | 00:14 | documents, into your presentations, or
into your web pages, as a way of sharing
| | 00:19 | the results of your analysis.
| | 00:20 | For this example, I use the same data
set, Searches.sav, and what I am going
| | 00:25 | to do is I will just make two or
three sample charts very quickly and then
| | 00:28 | show how to export them.
| | 00:30 | In this particular case, I'll make a bar chart.
| | 00:32 | I go into Graphs, and then the Chart
Builder, then I am going to make a bar
| | 00:37 | chart of regional variation and
interest in SPSS, because that showed up in
| | 00:42 | our regression results.
| | 00:43 | So I am going to come down and get the
Census Bureau Region, put that in the
| | 00:47 | x-axis, and get SPSS and make that
the variable as being charted here.
| | 00:52 | I put arrow bars on it and click OK.
| | 00:58 | And what I see is that the west has
much, much lower interest in SPSS as a
| | 01:04 | relative search term than the other
three regions, which would explain the
| | 01:07 | curious results of our
output in the linear regression.
| | 01:11 | I am going to change these just
for a moment, just a small amount.
| | 01:15 | Really I think all I am
going to do is change the colors.
| | 01:19 | You can change them however you want.
| | 01:21 | You can make individual
bars of different colors.
| | 01:23 | I will just press Close and close that.
| | 01:26 | So there's one chart.
| | 01:27 | Next thing, I am going to make a scatter plot.
| | 01:30 | Go to Graphs, back to the Chart Builder.
| | 01:33 | This time I will choose Scatter, and I
will bring that up, and I am going to
| | 01:37 | look at the association between
Business Intelligence and interest in SPSS.
| | 01:41 | Now I'll hit OK and I
have got a scatter plot there.
| | 01:46 | I am going to clean it up slightly.
I don't need all those decimal places.
| | 01:50 | So I am going to number
format and change those to zeros.
| | 01:53 | I will do the same thing over here, and
then what I am going to do is I am going
| | 01:59 | to change those to solid red circles.
| | 02:02 | Then I am going to add two lines because I can.
| | 02:06 | There is a regression line, but what I
am going to do with that regression line
| | 02:09 | is actually going to change it what
calls a Smoother that follows the pattern a
| | 02:13 | little more closely.
| | 02:14 | Then I am going to change
the color of that to Grey.
| | 02:18 | And then I will also add
a linear regression line.
| | 02:23 | It's added a Quadratic. That's okay.
| | 02:24 | I just change it to Linear.
| | 02:27 | I can delete that, and I am going to
change the color of the linear regression line.
| | 02:31 | I will make it grey also, perhaps a
darker grey, and there's my scatter plot.
| | 02:37 | And now what I can do is I can
take my charts and I can export them.
| | 02:41 | Now it's easiest to just simply
export everything in the output.
| | 02:45 | However, you may not want to have all of
the texts and all the other information.
| | 02:49 | So for instance, this right here is
the Log. You see I click on that Log it's
| | 02:53 | highlighted. I can delete that if I want.
| | 02:56 | This is the title of the chart.
| | 02:57 | We have something called
notice that doesn't show up.
| | 03:00 | It's there but it's hidden.
| | 03:01 | I find it convenient sometimes to
just come over here, and get everything I
| | 03:05 | don't want and delete it.
| | 03:07 | You can do that or you can leave it in.
I will delete them for one and leave
| | 03:11 | them in for the other.
| | 03:13 | But what I am going to do now is I am
going to save my output and then come
| | 03:17 | to File, to Export.
| | 03:19 | And what you have is a lot of options here.
| | 03:21 | You can export them as a
Microsoft Word document.
| | 03:24 | You can download them, as an Excel
file, as a PDF straight into PowerPoint.
| | 03:30 | Now I personally find the easiest way of
dealing with these is to export them as
| | 03:34 | an HTML file because what that does is
it exports the entire output as a single
| | 03:39 | HTML file, but it also downloads
all the graphics as individual files.
| | 03:45 | You can do JPEGS if you want. On the
other hand, if you're going to be putting
| | 03:48 | this up on the web, PNG
files can be more helpful.
| | 03:51 | The entire output is a single HTML file
and each chart is a separate PNG file.
| | 03:56 | All I need to do now is tell
it where I want to save things.
| | 03:59 | I click on Browse and I created a folder
already called SPS Output in HTML and PNG.
| | 04:06 | So I am going to double-click on that
and then I'll just call it Exported
| | 04:11 | Output, press Save, and I will press OK.
| | 04:14 | We have Exporting progress.
| | 04:16 | There are times when that can take
quite a while. This is a very short output.
| | 04:20 | Now I'll show you if I go to the folder
that I have created, SPSS output in HTML
| | 04:25 | and PNG, I can double-click on that.
| | 04:27 | Then you see we have an HTML
file here. I double-click on that.
| | 04:32 | This has the entire results.
| | 04:35 | See these, for instance, are the notes
that say they don't show but they are
| | 04:39 | there, and it has its graphics also.
| | 04:43 | On the other hand, I also have each
chart as a separate PNG file right here and
| | 04:49 | I can open it with the Windows
Photo Viewer and there it is.
| | 04:53 | In that way, I can take these graphics
and put them into whatever program I want,
| | 04:57 | how I feel like and best present them.
| | 04:59 | That ends the final presentation on how
to take the results of your analysis and
| | 05:04 | get a way to present them to others
that will make it easier for you to tell
| | 05:08 | your analytic narrative, to make sense
out of your results, to find surprises
| | 05:12 | hopefully and insights that will give
you an advantage in conducting your own
| | 05:16 | work, and make it easier to
sell your points to others.
| | Collapse this transcript |
|
|
ConclusionWhat's next| 00:00 | So that ends our course on SPSS
Statistics Essential Training.
| | 00:05 | Thanks for joining me.
| | 00:06 | I hope that this course has
been insightful and enjoyable.
| | 00:09 | I also hope that you've been able to
expand your analytical abilities so that
| | 00:13 | you're better able to work with critical
data in your research and professional work.
| | 00:17 | Now, here are some
recommendations for further development.
| | 00:20 | Your first stop should be the excellent
help applications that are included in SPSS.
| | 00:25 | These are more than just help files.
| | 00:27 | SPSS also offers presentations to
walk you through advanced procedures and
| | 00:31 | provides illustrated case studies.
| | 00:33 | I strongly encourage you to explore
those resources and see how they can help
| | 00:37 | you find ways to make the
most of SPSS in your work.
| | 00:40 | With that, it's time to let your data
talk to you and for you to have some fun
| | 00:45 | telling your own analytic
narrative. Best of luck!
| | 00:48 | We look forward to seeing you again soon!
| | Collapse this transcript |
|
|