navigate site menu

Start learning with our library of video tutorials taught by experts. Get started

SPSS Statistics Essential Training
John Hersey

SPSS Statistics Essential Training

with Barton Poulson

 


In this course, author Barton Poulson takes a practical, visual, and non-mathematical approach to the basics of statistical concepts and data analysis in SPSS, the statistical package for business, government, research, and academic organization. From importing spreadsheets to creating regression models to exporting presentation graphics, this course covers all the basics, with an emphasis on clarity, interpretation, communicability, and application.
Topics include:
  • Importing and entering data
  • Creating descriptive charts
  • Modifying and selecting cases
  • Calculating descriptive and inferential statistics
  • Modeling associations with correlations, contingency tables, and multiple regression
  • Formatting and exporting tables and charts

show more

author
Barton Poulson
subject
Business, Data Analysis
software
SPSS 19
level
Beginner
duration
5h 5m
released
Aug 17, 2011

Share this course

Ready to join? get started


Keep up with news, tips, and latest courses.

submit Course details submit clicked more info

Please wait...

Search the closed captioning text for this course by entering the keyword you’d like to search, or browse the closed captioning text by selecting the chapter name below and choosing the video title you’d like to review.



Introduction
Welcome
00:04Hi, I am Bart Poulson, and I would like to welcome you to SPSS
00:07Statistics Essential Training.
00:09SPSS is a statistics and data analysis program from IBM.
00:13It's very popular, it's very powerful, and it's a great way to work with your
00:17data for new insights.
00:19In this course, I'll demonstrate how to use charts, such as histograms, bar
00:24charts, scatter plots, and box plots to get the big picture of your data.
00:27I will show you how to use inferential statistics, like T-Tests, analysis of
00:32variance, and chi-square to help you determine the reliability of your results
00:37and how they can generalize to a broader population.
00:40I will also show you how to enter and read data in SPSS and how to check and clean data.
00:47If you're new to SPSS, I think you are going to be amazed with what you can do.
00:50If you're an experienced SPSS user, there will be many new tools and methods
00:55that can help you gain even more insight from your data, and with that in
00:59mind, if you're ready to get going, let's get started with SPSS Statistics
01:04Essential Training.
Collapse this transcript
Using the exercise files
00:00If you are a Premium member of the lynda.com Online Training library, or if
00:05you're watching this tutorial on a DVD, you will have access to the exercise
00:08files used throughout this title.
00:11The exercise files are contained in a folder, and there's one SPSS project
00:15folder for each movie.
00:16Inside the SPSS project folder, you'll find a data file and any other files
00:21needed to follow along with the movie.
00:23In some cases, there are additional assets, like data files, syntax files, or
00:28exported images and HTML files.
00:30If you are a Monthly subscriber or an Annual subscriber of Lynda.com, you
00:34won't have access to the exercise files, but you can follow along from scratch
00:38with your own assets.
Collapse this transcript
Using a different version of the software
00:00Before we get going, let me mention something about versioning in SPSS.
00:05SPSS has been around for over 40 years and has been revised frequently.
00:10It's even has changed its name a few times, from Statistical Package for the
00:14Social Sciences--hence the initials SPSS--to just the letters SPSS, to
00:19Predictive Analytic Software, or PASW, and then since it was purchased by IBM a few
00:24years ago, it has been known as IBM SPSS Statistics.
00:28Now, the movies for this course were created with Version 19 of SPSS, and new
00:34versions roll along about once per year now.
00:37However, only one of the movies in this course relies on any features that are
00:41brand new to this version of SPSS-- that's the movie on automatic linear modeling
00:46by the way--and even then, I show how to do the same things using commands that
00:50have been in essentially every version of SPSS ever made.
00:54Everything else in this course relies on procedures that have been in SPSS for
00:58at least several years and several versions.
01:01So while this course was created using the current version of SPSS, it applies
01:06almost universally to previous versions of SPSS and no doubt to future
01:10versions of SPSS as well.
Collapse this transcript
1. Getting Started
Taking a first look at the interface
00:00At first glance, SPSS resembles a spreadsheet. There are rows and columns of data where
00:06each column represents a variable, such as a customer ID number, a question on a
00:10survey, or a city's population, and each row typically represents a case, which could
00:16be a person, a company, an advertising campaign, or whatever.
00:19However, there's a lot more to SPSS than that.
00:23First off, SPSS has more than one window.
00:26It has two, or possibly three windows.
00:28The window we are looking at right now is the Data Editor window, or Data window,
00:33and I have a sample data set called searches.save opened.
00:37This is a data set that contains information about Google searches for specific
00:41terms, such as SPSS or regression, for each of the 50 states and Washington DC, and
00:47I will be using this data set frequently as a sample during this course.
00:51If you look at the tabs on the bottom left, this is what's called the Data view.
00:56The Data view is the one that looks like a spreadsheet.
00:58However, there is also one called a Variable view.
01:00If you click on that then what you see that it has information about the variables.
01:06The first column is the variable names.
01:09Variables in SPSS have to have a single-word names. They can be up to
01:1364 characters, they can have underscores or dots, and they can be upper- or lowercase.
01:19Otherwise they need to be relatively short, and again they do need to be a single word.
01:24The next column is the type of the variable.
01:26A string variable for instance is a text variable, and the state codes like
01:30CA or NY are entered as text.
01:32Everything else in here is entered as numbers and they're numeric variables, even
01:36though several of them have words laid over on top of them. I will show you in a moment.
01:40The third one is the width of the variable and the fourth one is the number of decimal places.
01:45The next one is what's called the Label.
01:47This means although the variables may have short names, like state_code, the
01:52label can be something that's a little easier to read, like
01:54State_code with capitalization. Or if you go further down to row 18, you see
01:59there is one called degree.
02:00That's the name of the variable, but the label is much longer.
02:03It is percent of population with bachelors degree or higher.
02:06So label can be much more descriptive, and since the label is what's going to
02:10show up in a chart or in a table, you want to make that long enough that it's
02:13easy to tell what it is.
02:15The next column is Values, and I said that most of these variables are
02:19entered as numbers.
02:20Now some of them just are numbers.
02:22The Google search information is numbers.
02:25They tell you how high a particular search term rates, relatively speaking,
02:30compared to all others for a particular state.
02:32On the other hand, other variables such as 15, 16, and 17 which has NFL, has NBA,
02:38and has MLS for Major League Soccer,
02:41those are Yes/No variables. Those are called indicator variables and I enter
02:45them as 0 for no and 1 is yes.
02:47So the numbers are what's in the dataset, but you can see that I tell SPSS in
02:51values, if I come over it and click on that, that 0 equals No and 1 equals Yes, and
02:56you can add them and change them and remove them in this dialog box.
03:00The next column is whether you want to specify explicitly any particular value
03:04to indicate missing information.
03:07Say for instance a person forgets to answer a question. You may want to
03:09indicate that's an accidental omission. Perhaps you can give that a 999 to
03:14indicate that it's accidental. Or if you didn't ask a question because it wasn't
03:18relevant, you could give a different code like 888, or whatever you want. Just make
03:22sure it doesn't overlap with the valid information.
03:25The next column is simply how wide the column is in the data set, and I make them
03:2911 spaces by default.
03:31Let's scroll over a little bit here. Then there is alignment within the
03:35column: Left, Center, or Right.
03:37The last two are specific statistical things.
03:39This is what's called the Level of Measurement and in SPSS a variable can either be
03:44nominal, which means it's simply indicates a different group and a string
03:49variable where you write words as nominal, but also a 01 indicator variable is
03:54nominal, or the region of the United States which has 4--
03:571, 2, 3, 4--regions, that can be nominal.
04:00A variable to also be ordinal. You can indicate, for instance, the client with the
04:04largest account, then the second largest, and the third largest.
04:09The other choice in SPSS is what's called a Scale Variable, and you see there
04:13is a little ruler next to it.
04:15These are variables that are measured as more or less in set units
04:18so you can actually calculate statistics like an average for them, whereas you
04:22can't with a nominal variable.
04:25The very last column is called the Role, and this is a relatively new feature in
04:29SPSS. And you specify, for instance, whether a particular variable is to be used
04:35as an input variable, that is, you're using it to predict values on other things.
04:39These are sometimes called independent variables or predictor variables.
04:43A variable can also be a target variable, and that is, it's always something
04:47that you're trying to explain, like for instance spending on particular
04:51products. Or a variable can be both, sometimes an input, sometimes a target and
04:56you see them marked as both.
04:58Finally, a variable can also be marked as none.
05:01That means it's not an input or a target variable;
05:03it's simply there for a state code as an identifier or indicator.
05:08And so those are the options in the Variable View window.
05:10Let me go back to the Data view now.
05:13The next thing to note is you can actually have a lot of variables in SPSS.
05:17It's limited only by its ability to address the variables.
05:21It can address over two billion variables in two billion cases, which you are unlikely
05:26to hit in most situations. But this is the Data window.
05:30Now, what makes this different, also, aside from the metadata and the Variable
05:34window, is it when you run a command in SPSS, unlike a spreadsheet, it doesn't
05:38show up on the same page.
05:39For instance, I am going to quickly make a chart. I'm going to make what's called
05:43a histogram for "interest in SPSS" as a search term.
05:46I go up to Graphs, and I click on something of a Chart Builder, which I will
05:51demonstrate more fully in a later movie.
05:53I am going to pick a histogram and drag it up into what's called the Canvas, take
05:59SPSS, and put it down here.
06:02Now what's interesting is I have a lot of options about how I set this up--and we'll save that--
06:05but I want to show you two things. One is I can click OK and go straight from
06:13that dialog box, not to the Data window but to an Output window, and in the
06:19Output Window I set it up so that it gives me the written code that can produce
06:23this chart over again.
06:24That's the information about the commands, and there's the chart.
06:28But you see this is a separate window. We had a Data window; now we have an Output window.
06:32I am going to back to the command for just a moment and show you an
06:36optional third window.
06:39Right next to the OK button there is something called Paste, and if I click that,
06:43it opens up a window called a Syntax window, and this is just command-line code.
06:48By pasting it, it has taken the written commands for this particular chart and
06:52it's put them in a Syntax window and I can use it to either modify the commands
06:58or I can use it to recreate the command at a later time.
07:00It's a great way of sharing information with people.
07:04So watch, I can simply highlight all of this and I can come up and press the
07:07big green Run button, the Play button.
07:10If I hit that, you will see that it's done it all over again.
07:14It's a great way of replicating analyses.
07:16For instance, you can set up an analysis when you have only part of the data, or
07:21you can run it periodically as new data comes in.
07:23It's a wonderful feature.
07:25Now let me show you a couple of other features here in SPSS.
07:29One for instance, is under the File menu and the Edit and the View.
07:33These are common things.
07:34The Data menu allows you to do a number of procedures to modify the data--
07:38I'll show this in a few movies--and so does the transform to create new variables.
07:43You can insert headings and titles in your output. Analyze is the actual
07:48statistical procedures menu, we will go through that.
07:51Now Direct Marketing here is a separate add-in.
07:54SPSS has a lot of add-ins that you can purchase separately to give increased
07:58functionality to SPSS, but I won't be demonstrating those.
08:01The techniques that I am going to be using in this particular course all involve
08:05the base procedures that are available in SPSS.
08:08The next command is to make graphs, and I have a whole series of movies about those.
08:13Utilities can be a way of getting more information about the variables or about
08:18creating scripts and production jobs, which are more advance procedures
08:22which we won't be covering in this course.
08:24Add-ons gets into some of the other services that you can purchase that connect
08:29with SPSS, such as SPSS Modeler which is for data mining and SPSS Text Analytics
08:35for analyzing open-ended natural language, like customer comments on a webpage or
08:40twitter feeds--it's a great way to go.
08:42And then finally, the Help menu here gives you a huge amount of information.
08:47Let me open up, for example, the Tutorials, and this opens up in a web browser,
08:52although it's a locally stored file.
08:54And what you see here is an entire collection of presentations that SPSS will
08:59run through to teach you how to do any of a number of procedures, and they are
09:03very useful for learning how to use SPSS in even more depth.
09:08Back in SPSS, there is also what's called the Command Syntax Reference.
09:12This is a 2500-page searchable PDF file about the command-line syntax
09:18programming that you may be able to use it at a later point in more
09:21advanced functioning.
09:23Now there are just a couple more things I want to show you in SPSS about how
09:27to set up the program.
09:28If I come back to Edit and go down to Options, there are number of things you
09:33can do to customize the way SPSS works for you.
09:36There's a few in particular I want to point out.
09:38One is in this tab called Viewer. Down at the bottom, on the left, there's a
09:42checkbox for Display commands in the log, and that's the thing that makes it
09:46so that SPSS inserts the written code that produces each analysis, or each
09:51display, as you go through.
09:53I find it a very helpful thing to do, in addition to pasting the syntax into a
09:57syntax window to be saved separately.
10:00The other one that I think is important is under Output Labels, the second one
10:03from the right on the bottom.
10:05Output Labels lets you show things as either the labels that you give them--you
10:09may recall for instance we had the variable called Degree, which had a much
10:13longer label about percentage of population with a bachelors degree or higher.
10:18You could either have that long labels show up in the output and in the tables and
10:23in the figures or you could have the short name, which is just degree, or you
10:28can have both of them.
10:30Similarly, with the Value Labels, like for instance, I had whether a state had
10:35NFL team, I had 0 as No, 1 as Yes,
10:39Labels means you can have the yes's and the no's shows, but you can also do it
10:43as 0s in 1s, and you can also do it as 0, No, 1, Yes.
10:49And I use one or the other depending on the situation.
10:52It can be a good way to keep track of things.
10:54It can also be a way of making things more presentation-ready to use just the labels.
10:59And yes, those are the options, and I encourage you to search through some of
11:02those little bit more to see what else is there.
11:05So the organization of SPSS, says there is a superficial similarity to a
11:09spreadsheet, but you can see that it has been developed with an eye towards
11:13making statistical graphing and analysis much faster and more organized.
11:18Also, with the option to Paste command syntax into its own window and save it as
11:23part of the output with each procedure, that makes it much easier to keep track
11:27of what you do to share with others and to repeat analyses.
11:31Finally, SPSS's extensive help collection can make it easy for you to get
11:35directions and walkthroughs on nearly every procedure that SPSS does.
11:39In the next video, we will talk about one other setup process, and that is
11:44getting data from an external spreadsheet into SPSS.
Collapse this transcript
Reading data from a spreadsheet
00:00While it's possible to enter data directly into SPSS or download it in the
00:04SPSS.sav format, data sets will often come to you in other formats, such as
00:11database files, text files, or frequently as spreadsheets, and there are
00:15actually advantages to this.
00:17Files in these other programs, such as spreadsheets, are usually easier to create
00:21and share than our SPSS files.
00:24Also, SPSS is well set up to import data from each of these formats.
00:29In this movie, I will show you how to work with spreadsheets in Microsoft's .xls
00:34and .xlsx format from Excel.
00:37At the end of the movie, I will point you to SPSS's excellent instructions and
00:41tutorials on importing data from other sources as well.
00:44I'm going to begin by using a data set that I downloaded from Yahoo Financial
00:49about the 2,800 stocks in the NASDAQ index. This is called NASDAQ.xls. And what we
00:56have here is the Symbol, the Name for each stock, as well as the LastSale Price
01:01before I downloaded,
01:03the company's Total Market Capitalization, the Year of its initial public
01:07offering, its Sector, and its Industry. And if we scroll to the right, you can
01:11also see a web link for a summary quote.
01:15Now to import this into SPSS, there are few things I need to do.
01:19Number one is I am going to get rid of some information that I just don't want.
01:22The information about the summary quotes here, I don't need that, so I am just
01:26going to come up here and I am going to delete that column.
01:29That makes things a little bit simpler.
01:31The second thing is I can't have variables that mix numbers and letters in them
01:37or SPSS treats them entirely as String variables or Word variables.
01:42The most egregious example here is the IPOYear.
01:45You see it says 1999 at the top, and then we have several N/As for Not Available,
01:49and what I need to do is I need to get rid of those N/As so SPSS will treat as
01:55strictly as a numerical variable.
01:57The easiest way to do that is to sort the column. I just click on a cell in
02:01there and come up to Sort, and I see we go from 1970 and I can just scroll down.
02:06There we go. I see I can select all of the N/As. I start there and come down
02:15to row 2821, I Shift+Click, and then I can just hit Clear Contents.
02:21Now I also need to check the other two dollar values, the LastsSale and the
02:25MarketCap, just to double-check.
02:27I am going to going to click on LastSale and I will sort that. See, it goes down
02:33to 1 cent. What's up at the top?
02:36Okay, I have a few N/As in there too, and if I left those in there, those three
02:40values could turn the 2800 and 18 others into String variables,
02:45so I don't want that. I'll press Clear Contents, and then I have a few here under MarketCap.
02:49I will clear those.
02:51I am going to sort MarketCap separately, just to double-check.
02:58And look, we have one more right there.
03:02Once we have done that, I believe we are ready to import this.
03:05It's okay that I have N/As in Sector because that's a text variable anyhow.
03:09I am just going to come back over to the first column, Symbol, column A, and sort
03:16that by the Symbol again from top to bottom.
03:20So we start at the Australia Acquisition Corp.
03:24I am going to save this data set, and then I need to close it because SPSS can't
03:29open it if it's open in Excel.
03:31So I am going to close the data set, minimize this, and here I am in SPSS now.
03:36If I just come over to File, to Open, to Data Set, and I simply navigate to the
03:43folder where I have this spreadsheet,
03:45now I need to tell SPSS that I am looking for a spreadsheet,
03:48because right now it's trying to report on .sav files.
03:50I come down to spreadsheets, and now it shows up, and I can just double-click on it to open it.
03:56It gives me a suggested range of the data. If there's more than one worksheet in
04:02the spreadsheet, it automatically suggests the first one; but if you have others,
04:06you can navigate to them in this way.
04:08But I am going to use data--that's the Name of the worksheet--cells A1 to G2821.
04:15I will just press OK, and there we go.
04:19You see, for instance, that the variable names are listed across the top in
04:22the blue row and we have the Symbol, the Name, the LastSale, and the
04:28MarketCap, the IPO.
04:30Now in IPO I cleared out the N/As, and those were blank cells in Excel.
04:35Here they have dots.
04:36A dot is what goes into a blank numeric cell in SPSS.
04:40So actually, that still indicates that those are missing.
04:42I am going to scroll over to the right for minute and see what else we have. We have
04:47Industry. I am going to make that little skinnier by just dragging it over,.
04:51I am going to come back, and I will take the Name, and I will make that
04:54skinnier so I can see more of the data.
04:57I do need to fix a couple of things. The LastSale and the MarketCap are both
05:01dollar values, and I need to turn them into dollar values and change the decimal
05:05places for both of them.
05:07So what I am going to do as I can either click on the Variable View tab at the
05:10bottom left or I can simply double-click on the name of the variable.
05:13I will do that. And I can go to Type, until it's a Dollar value.
05:19And I will click this one down to the bottom, just two decimal places, and that
05:26should do. The LastSale, the highest value is in the thousands, but I do need
05:32to have two decimal places because they do use the cents.
05:34On the other hand, MarketCap is huge numbers.
05:38It goes up to hundreds of billions, and I don't need decimal places.
05:43I am going to tell that one that it's a Dollar value as well.
05:45I will give it room for a lot of numbers, but no decimal places.
05:51I'm going to click OK, and now I can go back to the Data view and see what we got--and that
05:56looks like the correct format. And now I can simply save this data file as NASDAQ,
06:07and we are good to go.
06:09Now I want to show you that SPSS is able to import straight from databases or text files.
06:15In fact, if you go over to File, you will see here we have a command for opening
06:19from a database or reading text data.
06:22Now I am not going to go through those.
06:25Instead, right now, I am just going to point you over to the Help menu, to Tutorial.
06:31When you click on that, this will open up web browser, even though it's a local
06:34file, and the Tutorial, I want you to see this one: Reading Data. And in fact, if we
06:41open that up, you can see Reading from a Database.
06:45And SPSS has a tutorial that will walk you through every step that you need,
06:50using a similar procedure to get the data from a database and into SPSS.
06:56And so you see, with the proper preparation, it's a straightforward procedure to
07:01get data from one source--a spreadsheet, a text file, a database--into SPSS, so
07:07can begin exploring your data and seeing what your numbers can tell you.
Collapse this transcript
2. Charts for One Variable
Creating bar charts for categorical variables
00:00Once your data is in SPSS, one of the best ways to understand it is with charts,
00:04and most basic kind of chart is a bar chart.
00:07This simply indicates how many people or cases fall into each
00:11particular category.
00:13One of the great developments in SPSS a few versions ago was something called
00:16the Chart Builder, which is a unified interface for nearly every kind of
00:20chart that SPSS can make.
00:22Now I'm going to show you how to use the Chart Builder to create a simple bar chart
00:26to show frequencies, or how common particular categories are.
00:31I'm using a data set right here, this is called Movies.sav.
00:35This is a data set that I and my research colleagues put together that included
00:39the top grossing movies from each of several years, as well as movies that won
00:43awards in several different categories, from the Academy Awards.
00:47What I'm going to do right here is I'm simply going to find out how many movies
00:50in this are in each different genre.
00:53Now this is a text variable, and we're going to make a bar chart to show the categories.
00:58I simply come up to Graphs, to Chart Builder, and then by default right here
01:03it offers bar charts,
01:05that's the first one, and I just want the simplest kind possible.
01:08As a general rule, data graphics are designed to communicate, and they need
01:12to communicate clearly, and you want to use this simplest possible kind of
01:16chart that you can make, and a bar chart is a great one.
01:19And all I'm going to do is I'm going to come over to Genre.
01:22Please note it's got the three little circles that indicates its a nominal
01:25variable, and the A says that it's a text variable, as opposed to the year it
01:29released, which is also being treated as a categorical variable but it's got a
01:32number underneath it.
01:33So I'm going to just take this out of the variable list and I'm going to drag it
01:37into the canvas to right here under X axis.
01:41One of the nice things is that the canvas automatically changes the Y axis on
01:45the side to read Count because that's the most common thing I would want do with a bar chart.
01:49Now I have lots of options here.
01:52One thing I can do, for instance, is I can just simply use the gallery to get lots
01:56of different kinds of charts.
01:57I'm using the basic one.
01:58Now if you can't find what you're looking for in the gallery, you can actually
02:02create a chart out of basic elements.
02:04It's a lot of work and we're not going to cover that one.
02:06There may be situations you want to be able to stick an identifier on a
02:10particular data point, and you can do that here. Or you can add titles and notes.
02:15So for instance, I'm going to put Title 1, "Frequency of Movie Genres in the
02:24Dataset." Easy enough, and I press Apply.
02:27I can make other categories and other titles as well, but I'm not going to worry
02:31about those right now.
02:32All I'm going to do now is come over and press OK,
02:35and when I do to that, I get a large amount of output here that is the written
02:42record of a procedure that I just performed.
02:45I get this that says GGraph--
02:47that's the kind of graph we're making--the Source, the data set, and then
02:51here's the graph itself.
02:53And this shows, for instance, in this data set that that is based on top grossing
02:56movies and award winners, the dramas are more common than anything else, and that
03:00thrillers are the least common,
03:03most of this because a lot of these are drawn on award winners and thrillers
03:06win those less frequently than others.
03:08Now what I want to show you is there are ways to clean up these charts and to
03:12modify them to make them work little better.
03:14Aside from the simple fact that I think this is an ugly color,
03:17there are a lot of things that can be done to make this more communicative.
03:20To enter the chart in SPSS, all you've got to do is you come over it and you double-click.
03:26And this brings up the Chart Editor window.
03:29When you're doing your charts, you want to look at the order that the bars appear in.
03:32Now by default, SPSS puts them in alphabetical order, and there may be situations
03:37in which that's appropriate.
03:39However, it's usually easier to read charts
03:41if you sort the data by their values. In this case, I'd like to have the most
03:45common to the least common, and what I'm going to do here as I just click on the
03:49bars and I come over here to the Properties window and it says Categories, Sort
03:55By, at the moment it says Custom. I just want to put it as Statistic, and I'm
04:00going to make it Descending, and I press Apply.
04:04And now I see it goes from Drama, the most common, to Documentary, to Action, to
04:08Foreign, to Comedy, to Animated, to Thriller.
04:11If I want to change the colors of these-- these are still selected--I come over
04:15to Fill and Border, and I can change into a color that I find a little nicer.
04:19Now personally, I like to use light colors because I feel that it's easier to
04:24see them, but it does not dominate the vision.
04:28And so I've changed these to blue with a blue border as well.
04:31Also if you want to make these ones down here larger, these words, you can
04:36simply click on them and come over to Text Size.
04:39The preferred size is 8 point, which is really small, especially since most of
04:43the time these charts are going to be used in presentations, like in PowerPoint,
04:47where people are going to be sitting 20, 30, 40 feet away.
04:49So you can change these to be 12 point, for instance.
04:54Now what's happened is that SPSS has automatically changed them to a staggered layout.
04:59That's because they'd run over to each other were documenters much longer,
05:02and animated and much longer.
05:03One way to deal with this, and something that I do frequently, is when I have a chart like this,
05:09you can actually come up to the button that says Transpose the chart coordinate system.
05:15If I click on that, it switches the chart so that the labels are on the left and
05:21then the bars go off to the right.
05:23Now one thing that's happened to this is that the most common one is down by the
05:28bottom where the axis is.
05:29That's not helpful in this kind of chart, and so I'm going to click on the
05:33Categories, I'm going to click on the bars, go back to Categories, and instead of doing
05:36them Descending, I'll do them Ascending.
05:40And this puts the most common category on the top and the least common on the bottom.
05:44Also it may be that I don't really feel like I need this word Genre here in the title.
05:49What I can do is I can click on that and I can come over to Labels & Ticks in
05:53the Property window and simply uncheck Display the axis title.
05:58I click that, and the way it works in SPSS as almost any time you're going to
06:02do anything, you then have to apply it.
06:04I apply it, and that disappears, and I find this to be a much cleaner chart.
06:09And as a bar chart, it displays very well. The prevalence of each category,
06:15it puts them into a logical order from most common to least common, the
06:19labels are large enough that I can read them, and I've been able to work on this very nicely.
06:23Now once you've set up a chart in a way that you've modified it a fair amount,
06:28if you want to, you can come back to the Chart Editor and click on File and
06:34actually save this as a chart template.
06:36And it gives you the option of saving all of your settings, except
06:41I don't want to save all of the Text Content, so I will undo that, and I can say Continue.
06:47And I can simply save it as a Bar Chart Template Transposed, or whatever you
06:55think might be useful for you to find that template again in the future. I'll click Save,
07:00and now I can apply that template on other charts if I want to.
07:04But this is the most basic kind, and truthfully, one of the most informative kinds
07:08of charts, the bar chart, a simple bar chart, two-dimensional, that communicates
07:13the frequency of categories in a categorical variable.
Collapse this transcript
Creating pie charts for categorical variables
00:00In the previous movie, I showed you how to use SPSS's Chart Builder, its unified
00:05interface, for nearly every chart the program can make.
00:07And with it, we made a bar chart.
00:10In this example, I want to show you how to make another kind of categorical
00:14chart, the pie chart,
00:15that's a common choice for categorical variables.
00:17The procedure is very similar to that of bar charts;
00:20however, there are a couple of important differences.
00:23These have to do with the demands that pie charts place on the nature of the data.
00:27These are that the data must be exhaustive and mutually exclusive. What that
00:31is is exhaustive means that all the categories need to cover all of the
00:35possibilities and add up to 100%.
00:38That may require that you create an Other response category or a No Response category.
00:42Mutually exclusive means that each person needs to fall into just one category.
00:47And while there are many situations where the condition of mutual exclusivity
00:51isn't a problem, for instance a person can be born in only one country,
00:56there are least as many situations where it doesn't work--
00:58for instance, college attended, as many people have attended more than one.
01:02This can create a real limitation in the applicability of pie charts. Also, there
01:07is another issue in that bar charts are pretty easy to read, because you
01:10simply have to be able to judge the length of the height of a bar.
01:14That's a linear measure.
01:16Pie charts generally require that a person be able to judge angles and areas,
01:21both of which are rather difficult.
01:23And so these are challenges for pie charts, the demands they place on the data
01:28of being exclusive and comprehensive, and also the interpretability.
01:33Nevertheless, they are very common choices, so I will show you how to do
01:36these quickly in SPSS.
01:38Like all of the other charts we are going to do, you want to start by going up
01:41to Graphs, to the Chart Builder.
01:44From there, on the Gallery list, come down to Pie. Click on that and just drag
01:49the pie up in into the canvas. From there I'm going to pick Genre and put that
01:55down right there, and then I can press OK.
01:57Like in the basic bar chart, it's very colorful.
02:01You can see that the yellow slice is the largest of all--that's Drama--and that
02:06the purple is probably the next biggest, and the others are a little bit smaller.
02:10Now there are ways to customize the pie chart in SPSS, but given that I have
02:15think that pie charts are generally a little harder to read, I generally
02:19encourage you to try bar charts instead.
02:22But this is another common option for depicting the categorical variable in SPSS.
02:28So creating a pie chart in SPSS is a simple affair, and it still gives a lot of
02:32options to control over how it looks.
02:34However, given the challenges of reading pie charts, and the restrictions they
02:38place on the data, you may want to consider using a bar chart instead.
02:42On the other hand, in your corporate culture, pie charts may be the lingua
02:45franca, they may be what's expected, and you may want to introduce some
02:49variety in your charts, so they can be a viable option in SPSS.
Collapse this transcript
Creating histograms for quantitative variables
00:00In the last two movies, we looked at two different kinds of displays you can use
00:04for categorical variables.
00:05We looked at bar charts and we looked at pie charts.
00:08On the other hand, you may also have what SPSS calls a scale variable, also
00:13called a quantitative, or measured, variable,
00:15So for instance the percentage of critics who favorably endorse their movie, or
00:20the budget for the movie, or viewer evaluations, these are all measured as more
00:23or less quantities, and a bar chart and pie chart won't work for these.
00:27Instead, there are generally two kinds of charts that you want to make.
00:31The first one that we're going to do right now is called a histogram, and it's
00:35like a bell curve that shows the distribution of scores.
00:38Let's look at that one right now.
00:40Come to Graphs, to Chart Builder, and from here I come down to Histogram.
00:46There are a few variations, but the one that's most informative is the basic one.
00:50I grab it out of the gallery and drag it into the chart canvas, and from there I
00:55simply need to tell it what variable it is that I want to chart.
00:59In this case, I'm going to use Budget.
01:01I'm going to drag that down into the X axis.
01:04Now by the way, this is not the real data that SPSS is showing.
01:09When it uses a canvas it simply puts in some kind of random data to let you know
01:14that it's not producing a pie chart or something.
01:16Now I have some options here.
01:18One of them is whether I need IDs--I don't think I do--or Titles, and I'm going
01:23to put a title on this one.
01:26And I'm going to put "Budget for Movies in Movie.sav."
01:31And I'll press Apply, and then I'll press OK.
01:35And the Output window first shows the code that produces this one, and you can
01:40save that to rerun this later if you want to.
01:42It shows the name of the command in SPSS. It's GGraph.
01:46It shows the data set that was used to produce this.
01:49That's important, especially if you have more than one data set open at a time, and
01:52this is the chart as this produced by default in SPSS.
01:56It's called a histogram.
01:57You can see we have a whole lot of movies in the status that was very small budgets.
02:01This is $50 million, $100 million, up to a quarter billion there on the scale.
02:07And this tells us that there are about 23 movies with budgets in the lowest range.
02:12That makes sense when you consider these are a lot of award-winning movies,
02:15like animated shorts that people may not have seen and that don't require a huge budget.
02:21On the other hand, this chart is not particularly attractive, and it's got some
02:25communication problem.
02:26So what I'm going to do is I'm going to double-click on the chart.
02:28Then I'm going to take this information right here with the Mean, the Standard
02:32Deviation, and Sample Size, and I don't need that in the chart.
02:34I may need that information elsewhere, but I don't need it here.
02:37So I'm going to click on it and then I hit Delete.
02:39Then I see over here I have frequencies with decimal points on them, and I
02:44don't need that there.
02:45That's kind of silly. So I can click on that and then come over here to Number
02:50Format and I can put it to zero decimal places.
02:55Then, here across the bottom, these are millions of dollars and truthfully these
02:59numbers are hard to read, because there are so many digits there.
03:03What I can do is I can click on that, and I can come to Number Format, and I can
03:07go to Scaling Factor here, and I put it as Millions, and I press Apply.
03:13And now it's much easier to read, but I need to change this one.
03:16It says, Budget. I just click on that and I'll say, Budget in Millions.
03:21Now there are two other things I want to do here.
03:23Number one is, I find this to be in a very unattractive color, so I'm going to
03:26click on it, and since it's money, I might as well use green for my charts.
03:31There is a little curiosity here about the fact that we have three bars
03:36for every $50 million.
03:38Now there are some general guidelines for the number of bins that you should have
03:43in a histogram. These are bins, how wide each bar is.
03:47And we've got some gaps here, which means we might need a few more bins to help
03:51smooth out the pattern.
03:53Again, the idea here is that every chart, including histograms, is meant to be a
03:58simplification, an abstraction of the data.
04:01It needs to be informative and accurate, but it is a simplification.
04:04So sometimes reducing the number of bins can make it easier to see the patterns
04:08without getting overwhelmed by the complexities of the real data.
04:12So what I'm going to do is I've got these selected already, I'm going to come
04:16over to the Properties window, click on Binning, and then I'm going to
04:19come down to Custom, Interval Width. And what I'm going to do is I'm going to
04:24make it so that there are two bars instead of three for each one of these,
04:26so they are each 25 million wide.
04:30I believe that's 25 million.
04:32And now we have just two bars per gap, and it smoothes things out a little bit.
04:36And what you can see is that most of the movies in this particular data set of
04:40award winners and top grossers have budgets between 0 and 25 million.
04:44There are some very low-budget movies.
04:46This again, these short movies are some animated movies, and then we have some very
04:51large summer blockbusters with budgets of $150 million or $200 million.
04:56It's a good way of seeing what the distribution is like.
04:59When I'm done modifying the variable, if I want to, I can come to File and I can
05:04save that template and I can use it again later.
05:07I'm not going to do that right now.
05:10And then when I'm done editing the chart, I can simply press the X and close the
05:14chart, and there's my finished chart that I can export later.
05:17And again, a histogram is the first of two charts that you should generally use
05:22when you're looking at scale data.
05:24The other one, which we'll cover in the next movie, is a box plot, which is
05:27ideal for looking at outliers in distributions in which we appear to have in
05:31this particular one.
05:32But both of these charts are a great way of getting the feel for the shape of
05:36a distribution of a scaled variable, and give you a better idea of how
05:40well you meet the statistical assumptions of tests that you're going to be
05:43performing later on them.
Collapse this transcript
Creating box plots for quantitative variables
00:00When you're looking at what SPSS calls a scale variable--that's something that
00:04can be measured as more or less, like the percentage of critics who gave a
00:08favorable rating to a movie or the budget or the box office earnings for that
00:12movie--you should generally make two kinds of charts.
00:15The first one, which we did in the last movie, is called a histogram.
00:19It's like a bell curve, and it's a good way of getting a feel for the overall
00:22shape of a distribution.
00:24The second kind that you should generally make for a scale variable is called a
00:27box plot, and it's primary purpose in this context is to check for outlying
00:32scores, because they can cause a lot of problems in later statistical analyses.
00:37So you need to be able to identify whether you have outliers and often
00:41what those outliers are.
00:43So what I'm going to do now is I'm going to create a box plot for budget, which
00:47we used in the last movie on histograms.
00:51Come up to Graphs, to the Chart Builder, and from there I come down to the list, to Boxplot.
00:56There are several different versions of box plots.
00:59I am going to choose the simplest one possible.
01:01That's this one over here, which is called a 1-D Boxplot.
01:04It's for charting all of the cases on a single variable.
01:08If I wanted to break down budgets by a genre of film, I could do that over here,
01:14under what's called a Simple Boxplot, but it's grouped, and I will show that in a later movie.
01:18But right now I'm simply going to drag the 1-D Boxplot up to the canvas, and then
01:24I'm going to bring in budget to the Y axis.
01:28This is the general format of a box plot.
01:30I will explain more when we look at the finished version.
01:34But I am going to do a couple of things.
01:37Number one is I may want to identify points.
01:40If click on Point ID Label, and then I can actually get the movie name and I can
01:45drag that into here,
01:46so if I have unusually high or low points, it will actually tell me what the movie is.
01:51It makes life easier.
01:53I can also put titles on.
01:54I will have a title, and I will put Boxplot of Movie Budgets.
01:59Then I will press Apply, and for both of these I can now press OK over here.
02:06And what comes up is this particular chart.
02:09This is the text that is the syntax that produces the command.
02:13This is the name of the command,
02:14this is the data set, Movies.sav, and this is the Boxplot of Movie Budgets.
02:19What you have here is budgets ranging from 0--
02:23there's actually nothing with 0-- up to $250 million for the movie.
02:27This is from a few years ago.
02:30And this box right here shows the quartiles of a distribution, and this is the
02:35minimum value of any movie in the data set.
02:40This right here is the highest non- outlying value, and I say non-outlying because
02:45we have two outlier movies.
02:47In this particularly data set Spiderman 2 and King Kong both had budgets of
02:53approximately $200 million.
02:56On the other hand, this box down here shows you the median, that 50% of the
03:01movies--there were 61 in this data set, so 30 of them--had budgets beneath this,
03:06which is around $25 or $30 million, and half of them were above.
03:11Now, I am going to show you a few ways to modify this chart that I think will
03:16make it a little easier to deal with.
03:17As with every chart in SPSS, you modify it by first double-clicking on it to activate it.
03:24That brings up the chart in a Chart Editor window and it brings up a Properties
03:28window to the right.
03:29Now, one thing that I personally like to do is I like to turn these charts
03:33sideways by coming up to the button bar and clicking on the button that says
03:37"Transpose the chart coordinate system."
03:40The reason I do this is because the other charts that we make up, like
03:43histograms and like the scatter plots that we will show later, they have these
03:47variables listed across the bottom, with the lowest value on the left,
03:51highest value on the right, and I find it helpful to be consistent in this particular way.
03:55I'd like to change the color of the chart.
03:58I click on the box, come over here to change the fill, and then the border
04:03I can change to another color if I want.
04:05I can change the way these bars work at the end.
04:08These are sometimes called whiskers.
04:10They go to the lowest and the highest non-outlying value.
04:14In case you're wondering, outliers are determined by being one and a half times
04:19of this middle range above or below the range.
04:22What we're going to do is I'm going to change the way these whiskers are.
04:26This is just a preference issue.
04:27I click on that, and I come over her to Bar Options, and I am going to change it
04:32from a T-bar to what's called a Whisker.
04:34It's just a line at the end.
04:37And then here, if I want to, I can actually change the way that these look at the end.
04:43I have the movie labels there as well.
04:45Finally, if I want to change the Axis labels here on the bottom, like I did with
04:48the histogram where I changed these to millions of dollars, I click on the
04:51numbers, and I come over to the Properties window, to Number Format, and the
04:56Scaling Factor here, I'm going to put in millions.
04:59I am going to press Apply, and this now gives me millions of dollars.
05:04And I need to change this--
05:05it says Budget--to say Budget in Millions.
05:08I can close the chart, and now I have a good depiction that the overall
05:15distribution is on the low end, because this is movies that included award
05:19winners, that half of the movies have budgets of 30 million or less, but they go
05:26up to about 150 million, and that in this particular data set we had two other
05:30movies--Spiderman 2 and King Kong-- that had unusually large budgets, as is
05:34common among summer blockbusters.
05:37Anyhow, when you're looking at a scale variable like budget, like viewer
05:41evaluations, like time spent on tasks, like time spent viewing a web site, then
05:47you do want to look at both the overall shape of the distribution with the
05:50histogram and you want to check for outliers, and a box plot is an ideal way
05:55to do that.
Collapse this transcript
3. Modifying Data
Recoding variables
00:00Many times your data won't come in exactly the form that you need it for analysis.
00:06For example, you may have groups that need to be combined, or you may have
00:09outcomes that need to be counted or scores that need to be reversed to be more
00:12interpretable in your results.
00:14All of these fall in the general rubric of recoding variables.
00:18There are several ways to do this in SPSS.
00:20The first way that I want to show you in particular movie is what you might
00:24call a manual recode.
00:25And the way you do this is by coming up to the Transform menu and then you
00:31select either Recode into Same Variable or Recode into Different Variables.
00:35Now let me give you a quick warning here, when you recode into the same variable
00:40you're overwriting existing data, and while that maybe able to save some space,
00:45if you make a mistake in the recode, you will not be able to go back to what you
00:50had before. And for that reason, I recommend that you almost always recode into a
00:55different variable, which is what I am going to do in this particular case.
00:59By the way, the one I'm going to look at is this one here at the end.
01:03It's called In the Past 30 Days Have You Felt Worthless? and there are several
01:08responses that go from Never to Almost Every Day and what I am going to do in
01:14this particular one is I am going to recode it as people who have never
01:17versus at least sometimes.
01:19So I am going to be taking all of the answers above zero and making them into
01:24a single Yes code, that they have felt worthless at at least some point in the past few weeks.
01:30So what I do is I start by taking this variable.
01:32It's called Numeric Variable and it's FeelWorthless, and I am going to create--
01:36and I call it, EverFeltWorthless because the other one asked about how often.
01:39This one is going to be "Have you ever?" and I am going to be put in the label for
01:42this one and I am going to call it Has EverFeltWorthless. And I click Change and
01:48now it puts FeelWorthless would be coded into EverFeltWorthless.
01:52Then what I need to do is I need to specify the old and the new values for the recode.
01:57Well, what I am going to do in this one is I am going to take zero, and that's
02:01going to stay zero, so those are the people who said they never felt worthless--
02:05that's going to stay that way--but then what I am going to do is I am going to
02:08specify a range, and I am going to put anything 1 through the highest value, so
02:15that's 2s, 3s and 4s, that that any of those can become a 1.
02:20Now this new one I am creating is going to be called an Indicator Variable.
02:23That's a 0/1, yes/no variable.
02:26It's a good way to do it because it allows you to also do certain numerical
02:31statistical procedures with it.
02:33Now if I wanted to set up a more detailed correspondence, I could.
02:36Say for instance, I had a variable in an opinion survey that was coded as 1
02:41strongly disagree, up to 5 strongly agree, but then it was reverse-coded
02:46so that, for instance, in this particular case, people are talking about what they did not like.
02:51In order to make things consistent, I may need to switch it around, and I may need to
02:54switch 1 to 5, 2 to 4, 3 stays the same, 4 to 2, and 5 to 1.
03:00I can do that by putting in each one of these manually, but because I have a
03:04pattern here where I am putting 0 stays 0 and everything else goes to a 1, I
03:08can do this particular method.
03:100 stays a 0, but everything else goes to a 1.
03:16Now that that's done, I can press OK and this is the syntax statement.
03:22The command is RECODE. It says Worthless (0=0) 1, so everything else equals to 1, into
03:28the new variable and then it has a label for the variable. The long name of it
03:33is EverFeltWorthless and then I turn that into a sentence, or phrase, for the
03:37label. And then the EXECUTE means it actually did the command.
03:40Now, you don't see anything else here because this doesn't produce a graph. It adds a column.
03:45It adds a variable to the data label.
03:46So if we go back to the data set and I go to the end, now you'll see a new
03:52variable here called EverFeltWorthless, and it's made out of 0s and 1s.
03:56Now I need to do a couple of things to clean this up here.
03:58Number one is it's got these decimal places that I don't need, because I don't
04:03have any 1.5s, I just have 1s and 0s.
04:04So I am going to come down to Variable view and I am going to change that 1 to
04:10have 0 decimal places.
04:11Also, I want to indicate that the 0 means no and 1 means yes,
04:16so I am going to come over to Values, click on that, click on the little
04:20box here, and I am going to type in value. I am going to put in a 0 and say that that means no.
04:26Click Add, and then I come back up to 1, and the Label is Yes.
04:31When I click OK, and that adds the labels.
04:35Now I go back here. I can see those on here now.
04:39So what I have done is I've taken an existing variable and if I click back here
04:44on the Value Labels button, you can see that I had 0s and 4s and so on that have
04:50all become 0s and 1s.
04:53So I've gone from something that had a very small number of people on the
04:56high end to trying to create groups that were slightly larger for working with
05:00by people who said they ever felt worthless.
05:03This by that way is from the general social survey.
05:05It's national survey of people across the country of all age ranges.
05:09And this is one way to save this coding to get a variable that's more useful in
05:16particular analyses.
05:18Now in the next couple of videos I am going to show you how to use something
05:21called visual binning and then something called ranking, and those are two other
05:25methods of taking the information that you have and putting it into a system
05:30that would work better for the analyses that you are going to do.
Collapse this transcript
Recoding with visual binning
00:00In our last video, we talked about one method of recoding variables, or taking
00:05the data in its existing format and changing it into another that may be more
00:11amenable to a particular graphic or a statistical analysis.
00:14In the last movie, we looked at what might be called Manual Recode by using the
00:19Transform command to recode into a different variable.
00:24In this movie, we are going to look at another one that's called Visual Binning.
00:27It's one of pretty attractive features of SPSS.
00:32We do this by coming up to Transform, coming down into Visual Binning. And you
00:39take a variable that has a wide range of scores--in this particular one, I'll
00:43take Age and I'll put that into Variables to Bin and press Continue.
00:50And what this shows me is the age range or the people in this particular sample.
00:55It goes from a minimum of 18 to a maximum of 87 years old.
01:00This is a national sample of adults and so this isn't surprising.
01:05Now, there may be times when I want to break this down into groups.
01:09For instance, I have one particular procedure where you like to take variables
01:14like this and you want to break them into actually five even groups that are called quintiles,
01:18even meaning it's the same number of people in each group.
01:22The Visual Binning is a perfect way to do this.
01:25Now I need to do something right here.
01:27We are going to be creating a new variable and it already knows to call out Age
01:31(Binned) into different bins.
01:33I am just going to call that, Age_Bin.
01:37And then what I do is I need to come down and have SPSS create cutpoints or
01:44different ways of separating the distribution.
01:46I come down here to Make Cutpoints, and I can tell it to make the intervals of
01:54even sizes, say for instance the 20 to 30 year olds, the 30 to 40 year olds, and so
01:59on, and that's one possibility.
02:00And maybe I would want to do that.
02:02I could say let's start the first one at 20 and then do it every 10.
02:06The one I'm thinking of is where I want to create five equal-size groups as I
02:11need four cutpoints to create five groups.
02:14See, right here it says, "N cutpoints produce N+1 intervals."
02:18And so what I'm going to do is I am going to create four cutpoints, and each
02:23one of them will have 20% of the sample, because there is five of them total, so that's 100%.
02:27I click OK, and what SPSS has done here is put in dividers that each
02:37has the same number of people.
02:39Now, some of these dividers will be closer, some will be further apart, because
02:43there aren't as many people in that group.
02:45So for instance you see in the 30 to 40 range, they're pretty close because
02:49there's a lot of people right there,
02:52similarly in the 40 to 47 group. But we have from 62 on up to get the
02:57same number of people.
02:59Now, these are automatically created.
03:01It may be however that I look at them and I say that yes, these are exactly
03:06equal groups, there are a number of people in each one, but I may want them to
03:09be slightly different.
03:11Maybe I don't want to have the last group start at 61, I think that
03:14sounds little silly.
03:15Maybe I'd want to change it to be exactly 60, and I'd want the other ones to
03:19change to be slightly different.
03:20So I can actually grab them and move them, ever so slightly, to be what I want them to be.
03:33Or I could try typing them in, to make sure they get exactly where I want them.
03:37I could change that to 40. I could leave the 47 where it is. I can double-click
03:45that one and change it to 60, and the last group is higher.
03:50And now I've got the cutpoints, and these are approximately equal groups; I
03:53changed them only slightly.
03:55Another neat thing is this is going to create a new variable called Age_Bin
03:58and these are the values, 1, 2, 3, 4 and 5, because I have created five different groups.
04:04I can also create labels automatically by clicking on Make Labels right here, and
04:09when I do that, it says that the first group is less than or equal to 30, then
04:1331 to 40, 61+, and so on. And all I need to do now is press OK, and it tells me
04:22that it has created one new variable in my data set.
04:28This is the history of the command. If I were to write it out by writing code,
04:33this is what I would do.
04:34But if I go back to my data set, I come to the end, and I see that I have a new
04:43variable here called Age_Bin that has the numbers 1 through 5 in it. And if I go
04:48straight above here to the button bar and click on Value Labels, you can see the
04:52label that shows each size group.
04:55And so the Visual Binning procedure is a wonderful feature of SPSS that allows
05:00you to create a new variable by grouping people on another scaled variable.
05:06This can save a lot of time when you're trying to create groups of particular
05:09sizes or split things up into particular intervals, like every 10 years.
05:14And so this is the second way we're looking at in terms of recoding variables.
05:19I am going to show you another one and it is called ranking variables which
05:23works in a pretty predictable way. But between the three of those, you should be
05:27able to do a fair amount in terms of getting the data into the form that you
05:30need them for your statistical analyses.
Collapse this transcript
Recoding by ranking cases
00:00In the last two videos we looked at ways that SPSS offers for
00:04recoding variables.
00:06For instance you can take a variable that comes in one particular form, like the
00:10words male and female, and recode it into another variable that has zeros and
00:15ones, an indicator variable that's useful in a lot of other analyses.
00:19Or we could do something called Visual Binning where we take ages and we create
00:24groups of ages to get, in this particular example, categories with approximately
00:28the same number of people, five categories or quintiles.
00:31A third option that SPSS offers that I am going to talk about right now is a
00:35particularly popular one.
00:37It's called its ranking and all it is is ranking people from first to last on a
00:42particular variable.
00:43So for instance in this example I'm going to take the Age variable again and I'm
00:47going to rank people from the youngest to the oldest.
00:51Now what this does is it numbers people from first to last.
00:56Theoretically, it could number people from one to 349, because that's how many cases I have.
01:01However, we do have tied values, people with the same age, and I'll show you how
01:06SPSS deals with that when doing a recode by ranking scores.
01:10What I am going to do is I am going to come up to Transform and come down to RankCases.
01:16Then I am going to pick the variable that I want, in this case it is Age,
01:21and move it over here.
01:22And you'll see you can do more than one at a time if you wanted.
01:26We could also get summary tables.
01:28You also get to decide whether you wanted the first place, the number one, to be
01:32the smallest or the largest value, and in this case I'm going to give the one to
01:37the youngest person, so I am going to leave it at the smallest value.
01:40However, there are several ways of dealing with rankings.
01:44The number one is just a straight ahead normal Rank.
01:47So it would go from 1 to, for example, 349.
01:51On the other hand, we can also have something over here that's called a
01:54Fractional rank as a percentage, and this would be like percentiles.
01:58So if you've taken a test, you know that you can get into the 95th
02:02percentile. You don't even know what really the highest score is on, but you
02:06know where it stands relative to others.
02:07We can do the same thing with Age here.
02:10This would give people percentile scores on their age. Are they the oldest,
02:13youngest, in the 80th percentile, or so on.
02:16Similarly, I have the option of creating Ntiles or quartiles or quintiles, like I
02:21did in the last one.
02:22I could have done this instead by telling it to create five equal groups.
02:26If I clicked on this one and put 5, it would do the five equal groups, which was
02:30sort of what I was doing in the last one.
02:33The Savage score and the Sum of rank cases as well as the Proportional estimates
02:37and Normal scores are rather sophisticated things, and I don't think that we need
02:40to get involved in these.
02:41I want to do the simplest form of ranking at this moment.
02:45So I am just going to leave it at Rank to default and press Continue, but I then
02:49need to decide what to do with tied scores.
02:52I've got a few options.
02:53Number one is to give them the mean.
02:55So I have people tied for seventh, eighth, and ninth, it would give all of
02:59them a rank of eighth. Or I could have it give them all a rank of seventh or all a rank of ninth.
03:07And so there are a few different options.
03:09I think what I am going to do in this one is I am going to do them all as the lowest.
03:14So it will be ranking them by age categories in this particular example.
03:18The Mean would make sense in other ones, but for Age, I think assigning the tie to
03:22the lower score would be the better choice.
03:24So I am going to press Continue.
03:27Now I also have an option of breaking things down by some other category.
03:31For instance I could do Gender, where I have people ranked as oldest to youngest
03:35for men and similarly for women.
03:37I am not going to do in this case, but that is an option.
03:40It would still create a single column of ranks. It's just I would need to
03:44separate them later by gender when I did them.
03:46So all I need to do now is press OK and it tells me that it has created a new
03:52variable from Age to Rank, and it's called RAge, R for Rank, Age, and it has a
03:58label on there. And if I go back to the data set, I can see it right here.
04:03If I hover over that, I could see that it's called a Rank of Age, and then
04:07here I see the ranks.
04:08I can scroll up and down.
04:10I see that I don't need three zeroes.
04:13If I add average scores, I would probably need those.
04:16So what I am going to do is I am going to come back over to Variable view, go
04:20down to RAge, and just remove the three decimal places.
04:24And when I do that, I have everybody ranked from highest to lowest.
04:29In fact, if I want to verify how this works, I can just right-click on this and
04:34I can say Sort this ascending,
04:37so the lowest scores will be at the top.
04:39And you see for instance, I have these people here all fall into these 30 and
04:44under group, which makes sense because they should be the youngest.
04:46As it goes up, I get people in the 40s to 61 to plus, and that is the highest
04:54group, and there is a confirmation that the rank performed the way I had intended it to.
04:59And so the ranking of cases is a third option for recoding, along with the manual
05:04recode that we did earlier, as well as the Visual Binning.
05:07And it can be a good way of making sure that your data both meet the assumptions
05:12of a statistical test, that they fall into a form that's easier to show in the
05:17graphs and analyses.
05:19Ultimately, it makes the results easier to communicate with other people, which
05:23is the goal of a statistical analysis.
Collapse this transcript
Computing new variables
00:00When you enter or import data into SPSS, you may want to know a person's average
00:05score on a series of variables, but it's usually a good idea to bring in the raw
00:09data and not a summarized version.
00:11That way you can recode or modify from the original information.
00:15Also some procedures, such as calculating something called the internal
00:20reliability of a questionnaire, those procedures may require the complete raw data.
00:25Once you bring the data in though and recode it as necessary, you can then
00:28compute the average scores, or a maximum, or spread, or whatever interests you,
00:33using SPSS's extremely flexible Compute command.
00:37I am going to do this by using the GSS data set that asks people if whether in
00:42the last year they had seen a classical music or opera performance, or they had
00:47attended a live performance of pop music, or they had attended a dance
00:51performance in last year, seen a live drama, or even just read a novel or poem or
00:56a play in the last year, and then saw art.
01:00And so we have here a series of sort of cultural indicators, and one thing we
01:04might want to do is add up how many of these things people say they've done to
01:09get a rough index on cultural involvement.
01:13One way to do this with Compute variable is to simply add these up, and I can do
01:17that even though it says the words yes and no here. If I come back up to the
01:22button Value Labels and click on it, you can see that I have zeros and ones
01:27underneath, and the nice thing about that, and this is why we prefer the
01:31indicator variables, is I can simply add them up.
01:34I can simply get a sum for these variables and find out how many of these people have done.
01:40I'm going to first create a space for this variable. Now you normally
01:44don't need to do this.
01:45It would simply add the variable at end.
01:47But I'd like the variable to be right here next to the other ones,
01:51so what I am going to do is I'm going to come to the end of that list and now at
01:56Happy, I am going to right click on it and insert a new variable. And I am going
02:01to double-click on that variable to edit it.
02:03It will bring up Variable view, and click right here under Name, backspace, and
02:11I'll change name to ArtTotal.
02:15I can leave the width at 8.
02:17I'll change the decimals to 0, because these are all integer values, and I'll add
02:22a label, Art Forms Participated.
02:25I'll also change this over here to a Scale variable, and I will change it to be
02:31both an Input and a Target variable, so I can use it either way by saying both.
02:36Come back to the Data view, and I'll save the data. And now what I am going to do
02:42is I am going to create a command that will add up these 1, 2, 3, 4, 5, 6
02:49variables and create a score here,
02:50so it'll go from zero to six.
02:54Go to Transform to Compute, and it asks me for the TargetVariable.
03:00Now I've already created it,
03:02so I can simply write here ArtTotal.
03:07And then it's going to ask for a numerical expression that's a formula.
03:11Now you can get very sophisticated formulas in SPSS.
03:14For instance, I can get an exponent, or I can do the modulus.
03:21In fact in a couple of videos from now, I am going to show you how to use the
03:25logarithmic function as a way of dealing with outliers.
03:28But all I really want right now is a very, very simple one.
03:31All I need is the sum.
03:32I'll go to Function group. Then I come down here in the Functions and Special
03:37Variables list till I find Sum, and if double-click on that, it adds to the
03:44numeric expression and then asks, what is it that I'm going to be adding up?
03:48I can back up and remove those, and then I can select the variables that I want
03:52to be included in the sum.
03:54I want this variable, SawClassical, and I can add each one of these with a
04:00comma between them.
04:01I can go like this, and I can add another one. But because these variables are
04:07sequential in the data file, I can actually use a shortcut expression.
04:10I can just list the first one, and then I can put space and write the word "to" and
04:16then the last one is "SawArt."
04:19And once I have that, it says to add up the scores on all these variables.
04:23Because they are 0/1 indicator variables, the sum will simply be how many of
04:27these did people say they've done in the last year.
04:30If I want to, I can make it so that it only calculates it for particular cases,
04:33for instance for just man or for people who are over particular age. I don't
04:38need to do that so I am going to leave it alone. I'll just press OK.
04:41And it asks me if I want to change the existing variable.
04:43Now, there's nothing there because I created a blank variable,
04:46so I can just click OK.
04:48It writes down that it did COMPUTE, that the new variable ArtTotal is equal to
04:53the sum of SawClassical to SawArt, and then execute to actually create that.
04:58When I go to the data set, I see a new variable right here with scores from 0,
05:02there is a 5, I don't know if we have any 6s, I can check that out.
05:06But now I've created a new variable that combines the results of these various
05:12cultural indicators to give me a single variable that I can use in further
05:16analyses, a way of correlating with other variables and trying to get an idea of
05:22who might be more or less involved in arts and cultural activities.
05:26And so the Compute command is a very flexible one, a great way of reforming
05:31the data to get it in the manner that can be most useful for your particular
05:35analyses.
Collapse this transcript
Combining or excluding outliers
00:00When you start looking at your data one of the problems you might have to
00:03deal with is outliers. These are extreme scores, like somebody who is 7 feet
00:08tall or somebody who has 26 children or unusual categories, like being Nepali
00:14or a Latin Poetry Major.
00:16Now sometimes these unusual scores or categories are inherently interesting, like
00:20with world records or gifted and talented programs in schools.
00:24In other situations, however, they can wreak havoc with statistical procedures
00:28that might be designed to look at general patterns, or overall trends.
00:32In the latter case, where you may be interested more in common scores than
00:36in uncommon scores, you have a few choices on how to deal responsibly with the outliers.
00:42Now the first question is how to define outliers.
00:45Now we've already looked at one way of getting a graphical definition of
00:49outliers on a scale variable, and it's with a box plot.
00:52I am going to come up to Graphs, to Chart Builder, to Boxplot. I will drag in
00:59the 1D Boxplot, and let's look at Market Capitalization.
01:04Also, because we have convenient stock symbols over here, I am going to ask for a
01:11Point ID so I know who the outliers are. I will just drag that over here and
01:17press OK, and what we see is that the variable for Market Capitalization is
01:23extraordinarily skewed, and in fact they often call this pathological skewed.
01:27We have Apple here with over $300 billion in market capitalization,
01:31Microsoft, Oracle, and Google, and it just goes down. And we have this huge
01:37number of companies that are stuck in a tiny level of market capitalization
01:41relatively speaking.
01:42In fact, we have no idea what the median or the mean is because those other
01:47scores all get squished together so much
01:49that there is 2800 companies in the NASDAQ listing, but we have these extreme
01:54outliers that are squishing all the others,
01:56that is not possible to really see what's going on.
01:59So we know that we have outliers here on a scale variable.
02:02Now on a categorical variable, like for instance ethnicity, what you then have as
02:08a definition for categorical outliers is that any group that has, for instance,
02:13less than 10% of the overall sample would be considered a categorical outlier.
02:18In that situation you have the choice of combining them with other categories and
02:22creating a sort of Other category except that it has to be very heterogeneous
02:27group. That or you simply don't analyze by that variable in the future.
02:31But let's talk about what to do with a scale variable.
02:35Now if you don't have very many outliers, or that they're not very far
02:39away, you can leave them in. You could take them as legitimate values and you
02:44could proceed with that understanding, as long as you communicate it
02:48adequately with others.
02:50On the other hand, another choice is to exclude them.
02:54Now I don't necessarily mean delete them permanently from the data set, but you
02:58can create a selector. We've done this before.
03:00I should just mention right here, this is $100 billion, and we still have a
03:04huge number of companies right there.
03:05I am going to select a much smaller number.
03:07I am going to go to $100 million capitalization.
03:10So I am going to go to Data, to Select Cases.
03:14Select Cases if your market capitalization is less than 100 million and press Continue.
03:24Now I have the option of just filtering them out.
03:27That creates a new variable that temporarily excludes or deleting them
03:30permanently, and I don't want to do that.
03:32I am just going to filter them out right now.
03:34So I am going to press OK, and it tells me that it has done that selection. And in
03:38fact, if I go back to the data set I will see that these cases got, for instance,
03:43Apple has been selected out. There is a variable here at the end now.
03:47There's a filter variable, and if I click on the value labels, I can see there
03:51are cases that are selected or not selected.
03:54And now I am going to go back, and I am going to do my box plot all over again.
03:59All I have to do is press OK, but this time I don't have any outliers.
04:04In fact, this is a pretty normal-looking box plot.
04:07I can see that of the 2800 companies in the NASDAQ, the median level of market
04:12capitalization is around $40 million.
04:15The first quartile, the first lowest 25% have 20 million or less, whereas the
04:22highest quartile have about $60 million or less.
04:25There are of course hundreds of outliers above these, but these give a nice
04:29picture of what you'll call the small capitalization market.
04:33Anyhow, the ability to either combine groups or to temporarily exclude outliers
04:39is one good way of dealing with them, as long as you can justify your choices.
04:44Again, that gets back to a general statistical principle that you can do
04:48whatever you feel is most appropriate and that serves your purposes in
04:51telling an analytical narrative. You're telling a story about your data, and
04:56if temporarily excluding cases or combining them with other groups serves
04:59your purposes best, then go ahead and do that, as long as you can justify your
05:03decision to others.
05:05Now, in the next video I will look at another way that does not exclude the cases.
05:10It leaves them all in, but changes them by doing what's called a transformation,
05:14to let you use all of your data and see if you can still find a way of telling
05:18a coherent narrative that way.
Collapse this transcript
Transforming outliers
00:00In the last video, we talked about a few relatively simple ways of dealing with outliers,
00:05that is, either leaving them in, if it can be justified; rolling them into other
00:10categories, but at the risk of a heterogeneous group; or deleting them or
00:14selecting them out temporarily of the analyses.
00:17Now while these approaches may make sense if you don't have too many outliers,
00:21say for instance no more than 2% or 3% as a rough estimate, they also do some
00:27damage to the data and can cause you to lose cases, and you may have worked very
00:31hard to get those data.
00:33So another alternative if you have a scale variable is to perform a mathematical
00:38transformation on the data.
00:40What this does is it modifies all the scores in the variables, generally
00:44creating a new variable on the process, using a set formula.
00:48Now people are very familiar with transformations, such as multiplying or adding
00:52or subtracting a certain amount, and that's taken as common practice.
00:56What we're going to be doing in this case, the most common approach for
01:00distributions that have a few extremely high scores, like the market
01:04capitalization one that we looked at in the last one, is to take the
01:07logarithm of the scores.
01:09Now you may remember logarithms from junior high.
01:12These have the effect of bringing in extremely high scores.
01:16So for instance, the logarithm of 10 is 1, the logarithm of a 100 is 2, the
01:23logarithm of a 1,000 is 3, and it brings in the scores in a predictable way.
01:29And this is a legitimate way of dealing with outliers, as long as you always
01:35specify that you were dealing with the logarithms from this point on.
01:39On the other hand, if you have unusual scores at the low end of the distribution,
01:44you might want to try squaring the scores, because what that does is it pushes
01:47all the scores up but pushes the higher ones even further.
01:51Now in both situations this assumes that you do not have zeros or negative
01:56scores, you have all positive scores.
01:58There are other ways of dealing with those. You can add a constant to them, but we
02:01don't need to deal with that right now.
02:04What I'm going to do is I'm going to look at the market capitalization data that
02:08we had in our last data set. Now I had filtered out cases of under $100 million market capitalization.
02:14I'm going to undo that filter right now.
02:16I'm going to Data, to Select Cases, to say please use all of them.
02:24And so now it just tells me that the filter is off, and you can see that none of
02:29them are selected out anymore.
02:30And I'm going to come back here and let's take another quick look at the box plot
02:36for market capitalization that we did before.
02:42We have an extremely skewed distribution.
02:45Now let's try to find if doing a logarithm could help make this a little less skewed.
02:52What we do is we come to Transform, to Compute Variables, and I'm going to create
02:58a new variable called LogMarketCap, and that's pretty easy.
03:04It is going to be the logarithm of the market capitalization.
03:08Now we've two choices for logarithm. Log10, this is what's called the base 10 logarithm.
03:13It takes the number 10 and raises it to a particular exponent to get a number,
03:17and that exponent is the logarithm.
03:19There's also the natural logarithm, which is on the base e 2.71828, dada, dada,
03:24dada, and an irrational number.
03:28And while they're very pleasing aesthetic aspects of the natural logarithm,
03:32because it's easier to interpret the base 10 logarithm,
03:35that's one we usually use.
03:37So what I do is I double-click on that and I bring it up the numerical
03:40expression. I just double-click on MarketCap and it fills it so it
03:44says Log10MarketCap.
03:47Press OK and it tells me that it's created a new variable.
03:52If I go to the data set, I can see it right here at the end.
03:57You see the numbers are much smaller than most double digits, but that's
04:01because we're dealing with very large numbers over here, and that logarithm has
04:04to do more with the number of zeros in the number.
04:07Now what I'm going to do is I'm going to go back and create another box plot,
04:11but instead of doing market capitalization this time, I'll do the log of the
04:16market capitalization.
04:17Just drag that in and leave everything else the same.
04:22And in this case, what's interesting about it is that we still have outliers, but
04:28this time they are symmetrically distributed,
04:30that we have outliers on the high end, but we also have outliers on the low end.
04:35And in fact, the distribution is remarkably symmetrical.
04:40It looks like it's spread out almost exactly the same amount in each direction.
04:44And you can see also that Apple, it is an outlier, but look how close it is for
04:48instance to Google, whereas here, here's Apple and here's Google down here.
04:54So what we've done is we've taken a extremely asymmetrical skewed distribution and
04:59by taking the logarithm, we've pulled it in and made it symmetrical.
05:04Now there are still outliers, but they are on both sides and they're not terribly
05:09far away like they were before.
05:11And so we've taken a variable that really we might not have been able to deal with
05:15before or we had to cut awful lot of the scores to make it work, but now we
05:20can actually leave all of the scores in, we can use the entire data set, and
05:24still come pretty close to meeting the assumptions of most of this statistical procedures.
05:29And so a logarithmic transformation in this case was a huge help in making our
05:34data meet the assumptions that we need to make it more manageable for analysis.
Collapse this transcript
4. Working with the Data File
Selecting cases
00:00When you're doing an in-depth investigation of your data, there are times when
00:04you'll want to focus on just some of the cases,
00:07for example, all of the men over 50 who visited your website, or clients with
00:11outstanding payments, or people under 16 who have taken the SAT.
00:15Now, one way to deal with this is to sort the data and then delete all the cases
00:20that you don't want and save it as a new data file.
00:23This is an option, but it can get cumbersome, and you do run the risk of
00:26multiplying data files or losing track of what you've got.
00:29An easier way is to have SPSS select the cases of interest, and when this
00:34happens, the other cases are still in the data set, but are temporarily excluded
00:38from the procedures, and you can then switch to different selection criteria or
00:43you can return to the entire data set.
00:44It's a more flexible and efficient way of working with interesting subgroups in your data.
00:49For this example I am going to be using the data set Searches.sav, which is
00:54information about Google searches on a state-by-state basis.
00:57The first several searches all have to do with statistical topics, for instance
01:02the SPSS Google search term or regression, and then I have some social media
01:06ones, and then I have some sports ones.
01:09One that's interesting at the right end of the data set--so I am going to scroll
01:12over--is an indication of whether a state has an outline for a high school
01:17statistics class, and maybe I would want to restrict my analyses temporarily to
01:23states that have this to see, for instance, if that's associated with their
01:27Google search patterns for statistical topics.
01:30So the way that I am going to do this is I am going to select cases.
01:33I go up to the Data menu, and then I come down to the bottom to Select Cases.
01:38And the dialog box gives me several options.
01:40The first one is to simply include all the cases, which is what I have right now.
01:44The second one is If condition is satisfied, and the idea here is, say, if they
01:49have a score on this variable that is equal to this, or maybe another one, I can
01:52have more than one variable.
01:54And this is what I am going to use.
01:56I am going to say whether they have the statistics education.
01:59That's going to be statistics_ed = 1.
02:02I will show that to you in just a second.
02:04I also have an option of using the random sample of cases.
02:07If I have a large data set, sometimes it's a good idea to try doing an analysis on a
02:11small part of it, let's say 20% or 30% or 40%, and then trying again with other
02:17parts of the data to see if the patterns I found hold there.
02:21You can also look for a time, or case range, for instance all the customers
02:24from 2009 or from 2007.
02:27And the last one, Use a filter variable, what happens is when I do a selection,
02:32SPSS automatically creates an indicator variable at the end of the data set.
02:36So if I have one already, this simply gives me the option of using that
02:39existing filter variable.
02:41The second below that, Output, is grayed out because I haven't done a selection
02:46yet, so I can't use those options.
02:48So what I am going to do right now is I am going to go to select If condition
02:51is satisfied, and then I click on the If box to say what my criteria are for the selection.
02:57What I want to use here is the variable about whether a state has a high school
03:03curriculum for statistics.
03:04That's near the bottom of the variable list on the left.
03:07I can simply double-click on that and it puts it up in the Selection box.
03:11Now, my selection in this case is very easy.
03:13This is a 0, 1 variable.
03:15It's called a dichotomous indicator variable.
03:17It only has two options. And I just want the 1s,
03:20so I am going to type statistics_ad, which is already there, and I am going to
03:24add =1. Once I've got that, I can go to the bottom and click Continue, and that
03:29shows up in my If condition is satisfied in the selection box.
03:32Now, the options at the bottom in Output show up.
03:36The first one is to simply filter out the unselected cases. It's the default.
03:39It's what I am going to use here.
03:40But I do have two other options that allow me to change the data set. The second
03:44one, Copy selected cases to a new data set, does exactly that.
03:46It creates a second data set.
03:49I have to give a name for that data set.
03:51And then if I want to work with just that one, it can be easier. Or I can get
03:56rid of the cases that I didn't select.
03:58There may be situations in which I want to do that. You can call that
04:00destructive editing.
04:01I usually just filter out the unselected cases, but it's up to you.
04:06So now that I have got my criteria specified by what I am selecting and what I
04:10am going to do with the unselected cases, I simply press OK.
04:13Now the output file shows me the syntax statements that it has used to
04:16create the selection.
04:17It doesn't show any charts here, because we don't have them.
04:19But if I go to the data file, you can see that on the left the row numbers of a
04:24lot of the cases are selected out, because not too many states have a high
04:27school statistics curriculum.
04:29Also, on the right side you can see there's a new variable there, Filter_$, that
04:35says Selected or Not Selected.
04:37That's a 0/1 variable.
04:38If I turn off the variable labels with the button on the menu bar, you can see
04:42that those are 0s and 1s underneath, but I will turn the labels back on now by
04:46clicking on the Value Labels button.
04:49So anything I do is going to work only with the cases that I have
04:52selected, which in this case are states with a high school statistics
04:56education curriculum.
04:57I will make a box plot, for example, of their SPSS searches.
05:02I click on Graphs, to Chart Builder, and then in the gallery on the bottom I
05:08go to Boxplot, and I am simply going to drag the one-dimensional box plot up into the canvas.
05:14And from there, I drag in the variable from the list that I want.
05:18I am going to take SPSS and drag that into the X axis.
05:22Also, because I may have outliers here, it's nice to have an ID to know
05:27what states they are.
05:28I can go down to the Group/Point ID tab, I can select Point ID label on the
05:33bottom, and then I need to drag in the variable that provides the labels.
05:38In this one it's the state code.
05:40So I come up to the variable list and drag the state code over, and now I
05:45am ready. I click OK.
05:47I first get a bunch of more code that's the syntax for what I have done.
05:50There is the GGraph command that gives the data set, and then here is the box plot.
05:56This shows the distribution of Google search patterns in terms of how common
06:02that particular search is relative to others for several different locations,
06:06and you can see we have an outlier, it's Washington, D.C. up at the top, and
06:10they search for this term SPSS much more than other states do.
06:15So anyhow, what I have here is a selection criteria, the ability to temporarily
06:21or permanently select a subset of cases for a more thorough analysis, and this is
06:25a great feature of SPSS.
06:27It lets you really dive into your data and get the most out of it.
06:30In the next movie we'll look at a related procedure called Split File
06:34that also lets you work with subsets, but instead of reporting on just one
06:38subgroup at a time, it gives the results for all of them so you can make
06:41comparisons between the subgroups.
Collapse this transcript
Using the Split File command
00:00In the last movie, we took a look at a really handy procedure for selecting
00:04subgroups of your data for a more focused analysis--that was the Select Cases
00:09or filter variable.
00:11In this movie, we will explore related procedure called Split File that also
00:16breaks the data down by subgroups, but unlike the Select Files command, it then
00:21gives you the results for all of the subgroups, and it'll let you make explicit
00:25comparisons between the groups, which can be a really handy feature.
00:29Now when we left the data set, I had some of the cases selected and some of them not.
00:35You can tell that this is the case because, obviously, over on the left a bunch
00:39of the rows are crossed out.
00:40Also, you see that on the right end of the date set, I have a variable called
00:45filter_$, and we have Not Selected and Selected.
00:49Also, at the very bottom right of the screen you see that it says Filter On.
00:54This is an indication that the filter, the selection criterion, is active.
00:59So before I go on to do a Split File, I need to turn off the selection.
01:05I go back up to the Data menu, to Select Cases, and than at the top of the box I
01:11simply click on All Cases.
01:14I don't have to erase the criterion.
01:16It's okay if it's still there and I press OK.
01:19And then the output, it tells me that the filter is off and I'm now using all the cases.
01:23If I go back to the data, you can see that none of the cases are crossed out and
01:28that down here on the bottom-right the Filter On is not there anymore.
01:32The variable that created the filter is still there if I want use it later, but
01:37now I am going to create a Split File where I can compare several groups.
01:42To do this, I am going go back up to Data and I am going go down to the bottom
01:46to Split File, which is right next to Select Cases.
01:50In this dialog box, I have three options for Split File.
01:54The first one is Analyze all cases, do not create groups. That's what I have now.
01:59That's the default. The next two,
02:01Compare groups and Organize output by groups, determine how things will look if
02:06I request several procedures, or a procedure that has a lot of output.
02:10The first one, Compare groups, puts the results for each step right next to each other.
02:15So for instance, if I have tables and charts, the tables for group 1, then the
02:20tables for group 2, then the chart for group 1 and then the chart for group 2.
02:24On the other hand, Organize output by groups would do the tables and the charts
02:28for group 1, then the tables and the charts for group 2.
02:31I'm going to use Compare groups in this case. It's a personal preference. From
02:35time to time, I might use the other one, and it's up to your judgment.
02:39I click on Compare groups and then I choose the variable that I'm going to use
02:43to split the groups.
02:44In this one, I'm going to use the region of the United States.
02:49So I need to scroll down on my variable list and if I make the box wider, you can
02:53see, I have Census Bureau Region.
02:56That's the label. The variable name is Region.
02:58I will just double-click on that, and there it is, in the Groups Based on box.
03:04So I've got the criterion in there, and by the way, you can put more than one
03:08in there if you want to split it by two variables, but then things get rather complicated.
03:13So I'm just going to press OK now, and now in the Results it tells me that it
03:18has sorted the data file by its region and that it's now going to split
03:21things by the region.
03:23If I go back to the data set, nothing is crossed out, because I'm using
03:27everything, but you can see that Region is sorted here in this column. And if
03:31you go to the very bottom right of this screen, you'll see that it says Split by region,
03:36so I know that it's going to be doing this where it does things separately for each group.
03:41So what I'm going to do now is I am going to request some information.
03:45I am just going do histograms.
03:47I go to Graphs and to Chart Builder.
03:51Now I am going to come down to Histogram.
03:52I am going to drag the basic histogram up into the canvas and then I select the variable.
03:59I will use the SPSS Google Search.
04:02So I click on that and drag it to X axis in the canvas and from there, I
04:07can simply Press OK.
04:09And then in My Results what you see is I have several histograms.
04:13These are very large chunky ones because there are not a lot cases in them, but
04:17this is for the first one. This is for the Northeast region of the United
04:20States. But this is for the Midwest, and this has more bars because there's more cases.
04:26If we come back up, you see there's only nine states in the Northeast region, and
04:31this one has 12, and we have the South and then the West.
04:37So what it's done is it's done a procedure but it's done it separately for each
04:42of these particular groups.
04:43I can get much more complicated procedures that we'll cover later in the
04:47course and break them down by region or by some other variable, or combination of variables.
04:53So the Split File command, along with the Select Cases command, this is a great
04:58way to focus on subgroups and get a deeper understanding of your data, and by
05:03comparing the results for one group to the next, you can see whether the
05:06patterns you find hold across groups or whether you should dive even deeper
05:10into your data.
Collapse this transcript
Merging files
00:00When you are getting ready to analyze your data, you may have the situation
00:04where your data lives in more than one file.
00:07Now, SPSS lets you have more than one file opened, but in a number of procedures the
00:12data needs to be in the exact same file.
00:14Fortunately, SPSS has a command that lets you combine data, either by adding new
00:20cases that have the same variables or by adding more variables for the existing
00:26cases, and in this movie I am going to show you how to do both of these.
00:31I am beginning with a data set that's called Search1.sav.
00:36This is simply the top-left quadrant of the data file that we used in the last two movies.
00:42I have information of a number of states about Google search patterns.
00:46What I am going to do though, is if you scroll down, you can see that I only have
00:51data through Montana.
00:53I have 27 cases here.
00:55I want to add the remaining states using the same variables, and what I have is
01:00another data file that has all the same variables in the same order but has
01:05the remaining states.
01:07To do that, I come up to Data and I come down about halfway to Merge Files and
01:14this is where it asks me if I want to add cases--that's more observations with
01:18the same variables--or whether I want to add variables for the same cases.
01:23I am going to do both, but on this one
01:24I am going to add cases.
01:27Now, you can do this with either a data file that's currently opened--that's the
01:32top one, an open data set, but that's grayed out because I don't have another
01:35data set opened right now--or you can use an external SPSS data file.
01:40I have that other data file.
01:42It's saved in the folder, and I am just going to open it up by clicking Browse.
01:46This one is just called Search2.
01:48I am going to double-click on that and then the full path shows up right here,
01:54and I am just going to click Continue, and so what it does now is it brings up a dialog box.
01:59It attempts to pair the variables by whether they come from the active
02:03data set or from the one that I am opening, but since I have the exact same
02:07variables in both of them,
02:08everything is paired up in the two of them.
02:10I can scroll down the list and you see that all the same variables occur.
02:15If I wanted to, I can select Indicate case source as a variable.
02:20That's at the bottom of the list.
02:22What this would do is it would add a new variable to the data set, and it would
02:28indicate whether the cases came from the first data set or the cases came
02:32from the second data set, and it's a way I am keeping things straight.
02:36I don't need it in this case because there is no overlap and there will be no
02:38confusion between the two of them.
02:40I am just going to press OK, and I get the syntax and the results that say it is adding cases.
02:48I go back to the data set.
02:49Previously, I only went through Montana, and now you can see that I have added
02:53Nebraska all the way down to Wyoming.
02:57Now, I have the same variables in the same order. Now I just have more cases.
03:01On the other hand, maybe I have the cases I want but I want to add more
03:05variables, more information about them.
03:08What I have right now is just Google's search history.
03:11I can scroll through, and all of these end with _GS to indicate these are
03:15Google Search patterns.
03:16But I have other information about each state that would be useful in
03:20analyzing these patterns.
03:22So what I am going to do now is I am going to add new variables to the data set.
03:26I go back to where I was before, I go up to Data, come down again to Merge Files,
03:33except this time I select the second option, Add Variables.
03:38Again, I have the option of using an open data set, but the one I have isn't open,
03:44or an external data set.
03:46Mine are saved in an external data set, so I am going to click on Browse and I am
03:50going to use Search3.
03:52I will just double-click on that.
03:54There it is and I click Continue.
03:58Now, it's bringing up the data set. There is one variable that is excluded and it's state.
04:04Now, that's the key variable that I used in both of them as a way of lining things up.
04:09You can see for instance that it has state and then it has a plus in parenthesis.
04:14That tells me that it's from the new data set that I am adding.
04:17So it would be redundant; we don't need it again.
04:21All I am going to do now is click OK and it tells me that it's adding a bunch of new variables.
04:26I go back to the data set, and previously we stopped with the Google Searches, the _GS,
04:34but now you can see I have added several new variables--
04:37I am going to scroll through them-- from has_NFL, whether a state has an NFL
04:42team, through Division.
04:44And so what I have done is in the first example I added new cases to the
04:48data set, I added new states, and in the second example I added new variables.
04:53And what this does is it takes three separate data files and combines them into
04:57one, which lets me do more analyses-- compare the relationships between the
05:02variables--than I would be able to do otherwise.
05:04Now, the data may have been spread out across several sources, in typically many
05:08different locally stored spreadsheets in an organization, and by merging the
05:13cases or the variables, you're able to get in a much more productive situation of
05:18having all of your data in one place.
05:21When you have that then it's much easier to break things down to compare the
05:25groups and to examine trends and outcomes.
05:28All of these can give you a much more powerful insight into your data.
Collapse this transcript
Using the Multiple Response command
00:00It's usually a good idea to enter your data in its least processed and
00:05most disaggregated form,
00:08that is, put the raw data in and any processing you need to do, do in SPSS.
00:13That way you can combine things if you want. On the other hand, if you bring the
00:18data into SPSS in an aggregated or combined or summary form, then you can't
00:23break it down later.
00:25Now one way of dealing with data that you want to aggregate, as long as you
00:29are dealing with nominal or categorical variables, is with the Multiple Response function.
00:35It's one of the neat tricks in the SPSS.
00:37This function combines the responses from several variables and allows you
00:42to create frequency tables and cross tabulations as though they were a single variable.
00:48In many circumstances, this can make life much easier.
00:51The first thing to say here is that you can organize the data in a couple of
00:55different ways, and Multiple Response can deal with either one of them.
00:59In this data set, Tickets.sav, I have hypothetical data about the purchase of
01:06season tickets to seven different kinds of events.
01:09I have Baseball and Basketball and Football as well as the Symphony, the Opera,
01:15the Theatre, and the Ballet.
01:17And the idea here is we might want to look at what kinds of season tickets
01:21people have, how many they have, and whether there is, for instance, a difference
01:25in the gender and the age and the overall preferences of the buyer. And again,
01:30this is hypothetical data.
01:32I have it set up first where I have each possible event, the three sports and
01:38the four cultural events, as indicator variables.
01:41So you see here for Baseball we have Yeses and Nos for whether a person has season
01:46tickets to Baseball, and then to Basketball and Football.
01:50Then I have a column that adds up how many sports events they have season tickets to.
01:54The first person has season tickets to two sporting events, Baseball and
01:58Football. The second person has none.
02:01And then I have four cultural events.
02:03I am going to scroll over a little bit, so you can see all of it, and I do a similar thing.
02:08I add up how many cultural tickets people have. Then I also have another one,
02:13combining both the sports and the cultural, how many season tickets they have
02:16all together. I am being a little optimistic, but this is how that works.
02:20So this is a series of what are called dichotomous indicator variables.
02:24Dichotomous means just two possible values, yes, no; male and female; and an
02:30indicator variable is a 0/1 variable, where 0 is no and 1 is yes.
02:35In fact, if I go up to the menu bar and click on this button for Value Labels,
02:42you'll see the 0s and the 1s that are underneath these.
02:44I put the Value Labels back on,
02:46you can see the Yeses and the Nos.
02:49So the indicator variables is one way
02:51I list every possible choice and I put down a Yes or No for each person.
02:56The other way of organizing multiple response data is by simply having a
03:01variable for the maximum number of choices that a person can have.
03:04Now in this hypothetical data set nobody had more than four sets of season
03:09tickets, and so what I have is Tix1, 2, 3 and 4, by whether they have season tickets.
03:16There are seven options for each one of these, and I simply put down the first
03:20one, the second one, and if that's all they have, I put 0s for the rest.
03:24You can see actually I have some people who have no season tickets at all,
03:27down about case 16.
03:30This is a way that people often do coding, especially if it's open ended,
03:34write down all of your feelings or your responses to a particular question,
03:38but I'll let you know right now, this kind right here, the Tix1 through 4 where
03:43we can have any of the categories in any of the columns, this can get
03:47extremely cumbersome.
03:48In my experience the indicator variables, even though we have to have more of
03:52them, is more amenable to adding things up and to doing other analyses.
03:57Now with that in mind let me show you how to set up a Multiple Response format.
04:02The first thing you have to do is define what are called variable sets, the
04:06variables that should be treated as instances of a single category.
04:11You go up to Analyze and then you go down near the bottom to Multiple Response
04:16and define Variable Set.
04:17You'll see I have two other options beneath that, Frequencies and Crosstabs. They
04:22are not available yet, because I haven't defined any sets.
04:25I click on that, and I am going to do this twice.
04:28I am going to do once with the indicator variables--that's the 0, 1, yes, no
04:32variables--and another one with the multiple choice ones, the four columns for
04:37the four kinds of tickets people have.
04:39So what I do is I first scroll down here and I'll pick the three sporting
04:45events and put those over here, and then I'll click the four cultural events,
04:51and I'll put those over.
04:52And then what it does is it asks me whether these are dichotomies--that's the
04:560, 1 for instance--or whether they are categories, where it's the 1 through 7.
05:01This part is the dichotomies.
05:03And it says, which one counts as a yes, because it might be 0, 1, but it might
05:08be 1, 2, or something else.
05:09I just have to indicate that it's the 1 that counts as a yes.
05:13And then I have to give a name to the Multiple Response set, and what I am going
05:18to call it here is TixDichotomies, Dichotomous Variables for ticket purchases.
05:29And then I click on Add over on the right.
05:33And so what this does is it creates a Multiple Response Set.
05:37It's $TixDichotomies. This won't show up in the data set because this is
05:42more like a metadata.
05:44It's information about the data set that the computer saves. So I have done this,
05:48and I can press Close now.
05:50You see the data set does not look different, but if I now come up to Analyze and
05:57back down to Multiple Response, I now have these two other options of
06:01Frequencies and Crosstabs available.
06:04What I can do for instance is I can click on Frequencies, and there is the
06:08Multiple Response Set that I just created.
06:10All I do is I move it over and I press OK.
06:16And I get a table that says, how many people had purchased each kind of ticket?
06:21Now this is the same thing as the 0, 1 indicator.
06:24It's simply telling me how many people had basketball tickets, how many
06:28people had opera tickets.
06:30So this is one way of doing it.
06:32I can also do cross-tabulations.
06:35If I go back to Analyze, to Multiple Response, to Crosstabs, I can say that I
06:40want to look, for instance, at whether there are gender differences in these.
06:45And I can put the Multiple Response Variable in the Column(s) and gender up here.
06:49However, I have to define the gender variable. I'll define the range and I
06:53simply tell it that I have 0s and 1s. Press Continue.
06:58Then I can click OK, and this is called a cross-tabulation.
07:02It lets me know the number of men and women who have season tickets of each kind.
07:05We'll go back to crosstabs in a later movie, but I just wanted you to see that
07:10there is an option with the Multiple Response Set.
07:13Now, I can also do multiple responses with the other kind where I have it open ended
07:18where people can put anything for the first set of tickets they have to second
07:22set. Let's look back at the data set.
07:24That's these four at the end.
07:25I only need four, because four is the most that anybody purchased.
07:29To do this one I come back to Analyze, back down to Multiple Response, and I am
07:34going to define a new variable set.
07:37This time I scroll down and I select these last four, First Season Ticket
07:42through Fourth Season Ticket, and then move those over to Variables in Set.
07:48In this case, they are not dichotomies; they are categories. And I need to tell it
07:52the range. There were seven possible choices, so I need to say it goes from 1 to
07:587. Then I need to give it a name.
08:01Now the last one was TixDichotomies. I might as well call this one
08:04TixCategories. Ticket Categories, this would be my label, and then I click Add.
08:13So that shows up as another response set.
08:16I click Close and I can do the frequencies and the crosstabs again using it this way.
08:22So I come back up to Analyze, to Multiple Response, to Frequencies.
08:29Now I used the dichotomies the last time. I'll just double-click and get that out of there.
08:34I'll use the Categories this time and hit OK, and you see I get the same kind of information.
08:41It's just the data was organized differently.
08:45I can also do the crosstabs the same way.
08:47Going up to Analyze, to Multiple Response, to Crosstabs, so this time I take out
08:53the Dichotomies and I put in the Categories.
08:58Now I get the same output either way, which will make it seem that these two
09:02methods of creating multiple response sets or equivalent; however, I'll let you
09:07know there is a trade-off.
09:08The Multiple Response set that's created on the categories, that is, with these
09:12multiple choice ones, where people could put any of the answers,
09:16about the only way to use these variables is with Multiple Response sets, and they
09:21are very limited in their application.
09:23On the other hand, if you do the indicator variables, which I had over to the
09:27left, these are much more flexible, and they can be used in other procedures like
09:33getting correlations and regression that we'll do later, which is why I almost
09:38always use the indicator variables, the 0, 1 variables for each choice.
09:42The only trouble is if you had, for instance, a lot of possible responses. You
09:47could end up with a huge number of indicator variables where you could only have a
09:52smaller number of these category columns.
09:55On the other hand, if you really have that many choices, you might be wise to
10:00your collapse categories and combine them.
10:02Anyhow, the Multiple Response function in SPSS can be a nice way of dealing
10:07with situations where people can choose or write in more than one answer to a question.
10:12The procedure is flexible because it can used dichotomous indicator
10:16variables, that's the 0, 1, for each possible choice, or a smaller number of
10:21categorical variables with several choices for each.
10:24However, the procedure does limit you to doing just frequencies or crosstabs for
10:29other nominal, ordinal variables. For these reasons I generally recommend that you
10:33use the dichotomous indicator variables.
10:35But for now the Multiple Response function is an important tool in your
10:39collection for data-analysis strategies.
Collapse this transcript
5. Descriptive Statistics for One Variable
Calculating frequencies
00:00One of the most general commands for getting descriptive statistics in SPSS, and
00:05my personal favorite, is the Frequencies command in the Analyze menu.
00:09This is a great way to get all of the common descriptive statistics you might
00:13want, such as the mean, the standard deviation and the quartiles--that includes
00:18the minimum, the median and the maximum-- for several variables at once, and to
00:22get simple charts such as histograms or bar charts at the same time.
00:27I view it as SPSS's one-stop shopping center for basic statistics for almost
00:32any kind of variable.
00:34For this example, I'm going to be using the NASDAQ data set.
00:38This is information about all 2,800 stocks listed on the NASDAQ Stock
00:44Exchange, and I'm going to be gathering some descriptive statistics about a
00:49few of these variables.
00:50The information about the LastSale-- that's how much shares went at the time
00:54that I gathered this data--the market capitalization of each country, as well as their sector.
01:00And what I'm going to do is I'm going to come up to Analyze, to Descriptive
01:05Statistics, to Frequencies, the very first one.
01:08Now the Frequencies command is associated for a lot of people with just
01:11categorical variables, because it gives frequency tables, how common each
01:16particular answer is, and it's well suited to this,
01:19but it also is a very well suited to dealing with scaled variables.
01:23I'm going to begin with a categorical variable, because that's the most familiar for people.
01:28The variable that I'm going to use in this case is called Sector Code, so I'm
01:31just going to come down here to SectorCode, select that, and move it over to the
01:36Variable list on the right.
01:38Now by default it's going to give me a Frequency table, but I can ask it for
01:42a few other things.
01:44With a categorical variable like SectorCode, the most important would be a bar chart.
01:49And if I come right over here to Charts, I can ask it to make a bar chart and
01:54just press Continue, and then I press OK.
01:58And what I have here is it tells me that it's gotten statistics for 2,820 cases.
02:04There's no missing data, and this first one is the frequency table that comes by
02:08default, and what it has is the name of each of the categories under Sector, from
02:13Basic Industries through Transportation.
02:16Then it has the frequency,
02:18that is, the number of industries that fall into each of those categories.
02:21For instance, 133 of these had no SectorCode listed, but under Healthcare, 234
02:28companies were listed.
02:30The next one is the Percent,
02:31that is, of all of the cases 1% fall into each one.
02:35So Capital Goods, which had a Frequency of 204, that accounts for 7.2% of the
02:42companies in the NASDAQ Index.
02:44Now the next one, Valid Percent, is the same because we have no missing data,
02:50but say for instance, that half of the companies were missing data.
02:53There was no response at all under SectorCode.
02:57Then instead of Basic Industries being 2.8%, it would be 5.6%, because the valid
03:04percent excludes the missing cases, or the cases that are missing on that
03:09particular variable.
03:11The Cumulative Percent simply takes the Valid Percent and adds it on as it goes.
03:16So it finishes with 100% by the time it gets to the last valid category.
03:21So that's the frequency table.
03:23The next thing is I asked it to produce a bar chart.
03:28Now this is a bar chart that is produced as a sort of supplementary feature of
03:33the Frequencies command, and I would probably want to go through and edit it to
03:38sort them from the most common sector to the least common.
03:42So Finance would be first and it looks like Transportation would be the last.
03:47I might flip it sideways so it would be easier to read,
03:50but those are the things that we covered in the section on creating bar charts
03:54as univariate charts.
03:55But this is a very simple way to get a lot of good information about a
03:59categorical variable.
04:00Next, what I'm going to show you is how to use the Frequencies command to
04:04get information about a scaled variable, something that people don't use that
04:08often for that purpose.
04:10I come back up to Analyze and I come to Descriptive Statistics, again to
04:15Frequencies, except this time I'm going to reset it, and I'm going to pick
04:20two scaled variables.
04:22I'm going to pick LastSale--
04:24that's the price of the individual stock shares the day before I gathered the
04:27data, and the market capitalization.
04:30So I just double-click to move both of those over, and then I can ask
04:34for certain statistics.
04:37There's a few that are really helpful.
04:38Number one is the Mean, the average.
04:41I also like to get the standard deviation, which is an indication of how
04:45spread out the scores are.
04:47The mean and the standard deviation are very common statistics, although they
04:51both work well for bell curves, and I happen to know that both of these variables
04:56are very skewed, and that's one reason why I also want to use what are called
05:01percentile- or quartile-based measures,
05:04that is, the minimum and the maximum and then the 25th percentile, the median,
05:09the 50th percentile and the 75th percentile, also called quartiles, all the way through.
05:15Now if I wanted to, I sometimes could get information about skewness and
05:19kurtosis which are indications of how closely the data fit a bell curve on
05:25normal distribution, but I'm not going to do that right now.
05:27So all I'm going to do now is I'm going to click Continue.
05:30Now because I have scaled variables, it can also be nice to get a histogram.
05:36And so I go up to Charts and I click Histogram.
05:40I could show the normal curve, what it should look like--
05:42undo that, that being more for humor here--and click Continue.
05:47Now there's one more thing I want to do here.
05:50When I come back to this list you see that the Display frequency tables,
05:54which is below the Variable list, is checked. That's by default in the
05:58Frequencies command.
05:59However, because all 2,800 companies have different market capitalization values,
06:04this will give me a list of 2,800 different values. I don't want that.
06:09I'm using summary statistics to avoid that,
06:12so what I'm going to do is I'm going to uncheck that.
06:14When I'm using the scale variable, I usually don't want the Frequency table.
06:18And now I can click OK.
06:21And what I get here are a couple of different things.
06:23First off, I get a table of statistics that lists each variable as a column.
06:29So the first column is LastSale, the second column is Market Capitalization, and
06:34then each row is the various statistics that it gathered, from the valid and how
06:38many cases have values for that particular statistic, to the mean and standard
06:43deviation, then to these quartile- based statistics. And then from these for
06:47instance, I can see that the average value of a share on the NASDAQ was $18.72.
06:55I can also see that the minimum is $0.01, at which point I think they
06:59drop off the market.
07:02Below those tables I have histograms.
07:05This is the value of a share in a particular stock, and what you can see is
07:11everything is bunched up really low.
07:13Most stocks have prices that are, for instance, below $50.
07:18And in fact, if I go back up to the table, I can see that 75% of the stocks have
07:24values that are less than $23.61, but some of them, the maximum, get huge.
07:31The maximum price for a stock on the NASDAQ is $1,132, which is why when we come down here
07:40we see that the scale goes all the way up to $1200.
07:41There is one very high outlier sticking out up there.
07:46I also have a histogram for market capitalization, and again, we know from
07:50before that this goes up to $300 billion and so most of the companies are stuck
07:55right there in the very first one and a very low level of market capitalization,
08:00But there's a few that go up very, very high.
08:02What these histograms do is they do give me an indication that we have some
08:06extraordinary outliers, but this also gives me an indication with the table of
08:12an idea of how I can describe those outliers.
08:15And so I think this demonstration shows how flexible the Frequencies command is
08:20and why it's one of my favorite procedures, especially because it works with
08:24both categorical and scale variables.
08:27It gives percentile statistics.
08:29It can do frequency tables.
08:31It can do charts at the same time.
08:33This makes it my first stop when getting the fundamental statistics for my
08:37data, and I'm sure you'll find it especially useful for your data and your
08:41analyses too.
Collapse this transcript
Calculating descriptives
00:00One of the first steps in any data analysis is to thoroughly investigate each of
00:05your variables one at a time,
00:07that is, to get univariate analyses.
00:10I've already described one procedure for getting univariate information with the
00:14frequencies procedure, and that work for both categorical and scale variables.
00:19Another important option for univariate statistics is the Descriptives command.
00:24This command and the Frequencies command do a lot of the same things but there are
00:28some important differences. The most significant is that frequencies can work
00:33with Categorical variables and Scale variables but descriptives works only
00:38with Scale variables.
00:39In this movie, I will highlight the similarities as well as point out some of
00:43the unique advantages of the Descriptives command.
00:46For this example I will be using the same data set that I used in the last one.
00:50That's information about the stocks in the NASDAQ index, NASDAQ.sav.
00:55To get the descriptives, I go up to Analyze, to Descriptive Statistics, to Descriptives.
01:02From here, I select the variables that I want.
01:04You will notice it doesn't list all of the variables.
01:07It only lists the ones that are numeric.
01:10The symbol and the name variables, as well as industry, are text variables and
01:16they are categorical and it simply doesn't list them here.
01:19So I am going to take the two that I used in the last example. That's LastSale--
01:25so I am just going to click to move that over to the right--and MarketCap--
01:29I am also going to move that over to the right.
01:31Then what I can do is I can get options where I select the statistics that I want.
01:39Now, by default, the descriptives gives me the mean, the standard deviation, the
01:44minimum and the maximum, and these are a good list.
01:47I can also get Kurtosis and Skewness if I want.
01:50What's important though is I cannot get the quartiles. I can't get the 1st
01:55quartile, or 25th percentile score, I can't get the 3rd quartile, or 75th
02:00percentile score, and I can't get the median, and for a skewed distribution those
02:04are important statistics.
02:06So that is one reason to sometimes use the Frequencies command over the
02:11descriptives, is if you need the median and the quartiles.
02:14But I will just click Continue.
02:16Now, I have another option here. You will see at the bottom-left
02:19it says Save standardized values as variables.
02:22This is one of the big perks of the descriptives command.
02:26If you want to take a variable that is in some metric like dollars or an
02:30arbitrary metric that may be a foreign currency you're not familiar with,
02:35sometimes you want to save things as standardized variables. That makes it so
02:38that the mean is 0 and the standard deviation is 1, and the individual cases
02:43get scores that indicate how many standard deviations above or below the mean they are.
02:48These are also called Z scores.
02:50I have seen people demonstrate how to do this manually by
02:54calculating everything.
02:55That's very tedious.
02:57The descriptives gives you one stop way of doing this: you simply click the box
03:01and it will add standardized values for these things. And we can see how many
03:06standard deviations above or below the mean some of the companies are on these
03:11items for last sale and market capitalization.
03:13Now, there is also an option here for Bootstrap.
03:16I am not going to get into that one because the Bootstrap is an add-in feature
03:20that you pay extra for in SPSS.
03:22I am just going to deal with the ones that come standard.
03:25So now I can press OK and what I get is a small table. In the Frequencies
03:31command, the variables were listed as columns across the top and the statistics
03:35were listed as rows down the side, but Descriptives, it's flipped around.
03:39But what I have here is the number of cases that I have information on.
03:44So for last sale I have 2817 companies with information on that.
03:48The minimum value is $00.1, the maximum value is $1,000 or $1,132, the mean is
03:5818.7, and the standard deviation is 34.65 and then you have similar statistics for
04:04market capitalization.
04:06Now, an interesting trick is if we go back to the data set, and you see that we
04:10have two new columns here at the end,
04:13ZLastSale for the Z score or standardized value, where you can see that most of
04:19the scores are close to 0 or 1. We do have a major outlier at 9.99.
04:24That's nearly 10 standard deviations above the mean.
04:28That's Apple computers, where their stock is costs about 10 times as much as most others.
04:33And then we have Z market capitalization.
04:36That's, again, a Z score, how many standard deviations above or below, and then Apple
04:41is again 32 standard deviations above the mean on this particular one.
04:47Hopefully, from all of this you can see that the Descriptives command is a really
04:51useful way of getting a variety of univariate statistics for your data. Like the
04:56Frequencies command,
04:57it can give the mean, the standard deviation, minimum, maximum, and other statistics.
05:02It can give you the standardized scores, which the Frequencies command can't do.
05:06On the other hand, Frequencies can give the percentile statistics like the
05:10quartiles and the median.
05:12It can give the mode, it can give frequency tables and charts, and it can work
05:16with string variables and categorical variables.
05:19Now, for these reasons I generally prefer to use the Frequencies command, but
05:23either one will get you a very long way towards a sound understanding of your
05:27data and a solid foundation for further analysis.
Collapse this transcript
Using the Explore command
00:00SPSS has a number of really wonderful tools for helping you to get an in-depth
00:05understanding of your data.
00:07We've already looked at the Frequencies and Descriptive commands, which can give
00:11you nearly everything you need under normal circumstances.
00:15However, there are times when you need to look at things even more closely and
00:19this is where SPSS's Explore command comes in, with more ways to look at
00:24univariate statistics than you can shake a stick at, and let's look at some of
00:28those possibilities.
00:30To get to the Explore command you go up to the Analyze menu, to Descriptives, to Explore.
00:37What you have here is a list of all the variables, both categorical and scale on
00:42the side, and a number of options here.
00:45What we are going to do is take the variables that we want and put them in
00:48the Dependent list.
00:49Now the term Dependent here means dependent variable, or an outcome variable, or
00:55the variables that you want a statistics on.
00:58In this case, I'll use the same ones that I used in the last ones.
01:02I'll use LastSale and I will use MarketCap.
01:08Now Factor List is in case I want to break down the list.
01:12For instance, if I wanted to do LastSale and MarketCap by different sectors.
01:17I could do that, but there are 12 different sectors and at the moment I
01:21don't feel a need for it.
01:23I can also label the cases, and this can be handy because this will give me some
01:27charts that show outliers, and in fact I'm going to do that by coming up and
01:32getting a stock symbol and putting that down there. Then I want to go through
01:37some of the options over here.
01:39I can choose what statistics Explore gives to me.
01:44I click on Statistics and by default it's going to give me the mean and a
01:48confidence interval for the mean.
01:51That's an indication of how spread out things are in mean and also given our
01:56sample what we think a true population value might be.
02:00We also have what are called M-estimators. That's a whole family of advanced what are
02:04called robust estimators that work well when things are skewed or they're
02:09outliers, but it's rather advanced.
02:11We are not going to deal with that.
02:13I can also get information about outliers, which might label them individually. I could do that.
02:19I don't think we need to.
02:20I could also get percentiles, where for instance it gives me the values for the
02:245th, 10th, 25th, 50th, 75th, 90th, and 95th percentiles.
02:30You can do it manually in the Frequencies command, but it's nice to have it as
02:33a one-click option.
02:34However, I usually don't need that, so I am going to skip it right here.
02:38I'm just going to click Continue. So I am leaving the statistics at the default.
02:41It has given me a ton.
02:43Next, I am going to look at Plots or the graphs.
02:47Now the first thing you can do is give me box plots, and we've done those
02:50separately in the univariate charts. And it's going to factor the levels
02:54together, which is fine, because I'm not splitting up the factors.
02:58It can also give me something called a stem-and-leaf plot, which is something
03:01that's normally drawn by hand, but I will show you that in a moment.
03:06I can get a histogram if I wanted.
03:07I've done those before, but I can get them additionally here.
03:11The next one is normality plots with tests.
03:13This is a series of plots that are designed to see how well your data fit a
03:19symmetrical normal distribution--
03:22that's a mathematical definition of a bell curve.
03:25Normality is the term for it, and that's important for a lot of statistics, but
03:30the normality plots can be a little tricky to read, and usually you can eyeball
03:35and see if your data seems to be behaving well, the way they would work well with
03:39a lot of other statistics.
03:40So I am going to skip both of those.
03:42I'll just click Continue, and let's take a quick look at options.
03:46Now this is one where it asks what to do with missing values in case I'm
03:50looking at more than one variable in my Dependent List, which I am.
03:53The question is whether I want to exclude cases listwise or pairwise.
03:58And this is something that comes up in a number of other procedures, and
04:01it's worth pointing out.
04:01When you exclude cases listwise, what that means is you only include the case if
04:07it has information on every variable that you're including.
04:12So let's say I had ten variables in the Dependent list.
04:15If a case was missing information on one of those, it would not be included.
04:19On the other hand, pairwise says include them whenever they have variables
04:25with some information.
04:27So it makes maximum use of the information, but you can end up with very
04:31different sample sizes, and there are procedures where it's very important to keep
04:36the sample sizes consistent going across.
04:38For Explore, that's a judgment call.
04:41You can do it either way.
04:42You can do it both if you want, one after the other.
04:44But I am just going to keep it listwise for now, the way it is.
04:47Click Continue and then down here it gives me the option to display just the
04:52statistics, just the plots, or both.
04:56I will leave it at both, which is the default. I click OK and I get a lot of output.
05:02The first one tells me how many cases there are and whether they have valid
05:05data, how many are missing.
05:06There are 2,816 cases with missing data, and in each case I have four that are
05:13missing information on LastSale and MarketCap.
05:15That's just 1/10th of 1%.
05:18Then I have a table called Descriptives.
05:20I scroll down and I have the mean.
05:22The mean for LastSale is $18.7, and I've seen these statistics elsewhere, but
05:28this one gives me a confidence interval for the mean, which is an inferential
05:31statistic, and we will see more about those in the next section.
05:34We also have something called a 5% Trimmed Mean.
05:37It shows us a way the highest and lowest few percentage points of the data and
05:41gives a slightly more stable estimate.
05:44We have the median and the indicators that spread with the variance and the
05:48standard deviation, and then we have several other statistics: the quartile and
05:52the skewness and kurtosis.
05:54So this is a lot of statistics that it gives all at once.
05:58You don't need all of them, but the nice thing is that they are available there.
06:03The second column, by the way, gives what are called standard error estimates
06:06for a few of the statistics, for the mean, the skewness, and the kurtosis.
06:11These are sometimes used as inferential statistics, but we don't need to worry
06:15about them right now.
06:16Then it repeats the table for the second variable, market capitalization.
06:21Then we have what are called the stem-and-leaf plots.
06:25These are ones that are usually drawn by hand, and what is it is it takes the
06:29values and splits them up into two-digit numbers, where the first digit is
06:34what's called the stem, and it forms the line here on the side.
06:38The second number is the leaf, and the neat thing about this is this can be
06:43read as a histogram.
06:44It's sort of a sideways histogram.
06:46But it also maintains the actual numerical values.
06:49So it's both a literal display of the data and a chart of a histogram, and then
06:55it marks some extreme cases separately at the bottom.
06:58Then here's a box plot.
06:59This is labeling the cases by their stock prices, and then we do a similar thing
07:04for market capitalization.
07:06So the biggest impression you might get might be that the Explore procedure is
07:11good for producing enormous amounts of output.
07:13It can be overwhelming, but if you really want to get the best picture or
07:18meaning the most comprehensive, not necessarily the most interpretable or
07:22useful picture, then the Explore command is the procedure of choice.
07:27It can give you stem-and-leaf plots.
07:29It can give you confidence intervals and trimmed means.
07:31It can give you robust estimators.
07:33It can give you normality plots, among other things, if you ask for them,
07:37all of which recommend its use in particular circumstances.
07:40On the other hand, the slightly simpler procedures of Frequencies and Descriptives
07:45can still give you nearly all of what you need without deluging you with output.
07:50Nevertheless, if there's one thing SPSS is good at, it's providing you
07:53with options, and the Explore command is one with especially rich options
07:58and analytical value.
Collapse this transcript
6. Inferential Statistics for One Variable
Calculating inferential statistics for a single proportion
00:00For many people, when they think of statistics, they think of inferential
00:04statistics, and not always fondly.
00:07Of course, there is much more to statistics and data analysis than the
00:10calculation of probability values, and this should be evident by the amount of
00:14time we spent so far on graphics and descriptive statistics.
00:17However, the ability to go beyond the data at hand and make inferences about a
00:22larger group of people--hence the name inferential statistics--is one of the great
00:26beauties of analysis.
00:28In this set of movies, I want to start with the simplest kinds of inferential
00:31statistics, those for one variable at a time.
00:34There are few different procedures that we'll cover, such as confidence intervals
00:38and hypothesis tests, for scale variables and proportions, as well as the
00:42distribution of a single categorical variable.
00:45But let's start with what is probably the simplest and most familiar, the confidence
00:49interval and hypothesis test for a single proportion.
00:52For this example, I'm going to be using the GSS.sav data set. That stands for
00:57General Social Survey. And it has one variable on the end here that I think is interesting.
01:02If I scroll to the end, I have a variable here that's called ReadBook, and what
01:06it means is whether the person says that they've read a novel, a poem, or a play in last year.
01:11We might be interested in the percentage of people who say that they have read one,
01:16whether that is significantly higher then, for example 50% and what the
01:21confidence interval for that might be, like you would get from a political poll
01:24where they say 73% of respondents plus or minus 3% who are in favor of a
01:29particular candidate.
01:31To do this, I'm going to many use one of SPSS's more interesting features. It's
01:35called nonparametric tests, and I get to it by going to the Analyze menu, down to
01:40Nonparametric Tests.
01:42It's called nonparametric because we're not using parameters like means and
01:45standard deviations.
01:47Then I come over to One Sample.
01:49And here it will do a lot of things automatically, but I'm going to be a little
01:53bit selective and customize it to actually make things simpler for right now.
01:57The first thing I'm going to do is I'm going to come here to Fields, and that
02:01really means variables.
02:02And right now it's putting in nearly every variable.
02:05It would test for equality of distribution on categorical variables, and it
02:10would also test for scale variables, whether they are normally distributed like a bell curve.
02:15I don't want to do all of that,
02:16so what I'm going to do is I'm going to take all of these variables,
02:19I'm going to put them back into the original field.
02:23The only test variable that I want is this one:
02:27Read Novel, Poem, or Play.
02:29So I'll double-click to move that over.
02:30Then I go to the Settings tab to choose exactly what test it is that I want to do.
02:34Now I'm going to do Customized tests here, and I'm going to choose Compare the
02:39observed binary probability--binary means two answers: yes or no--to the
02:44hypothesized value with what's called the binomial tests.
02:47And click on Options, and what it's going to do is it's going to do a hypothesis
02:51test to see if the proportion of people who say they've read a novel, poem, or
02:55play in the last year is statistically significantly different from a
02:59hypothesized proportion, which right now I'll leave at 50%.
03:03I can also get what's called the confidence interval. That's like the plus or
03:06minus 3% in a political poll.
03:08Now sometimes you can use conventional statistics, but right here SPSS is doing
03:12a very nice thing and it's letting me use what's called an exact statistic.
03:17In this case, it's called the Clopper- Pearson for the confidence interval.
03:20We don't need to go into any details except to say this would be a good choice.
03:24So I'm just going to click on that and I'm going to come down and press OK, and
03:28then I'm going to press Run.
03:30Now the output for this looks little different from what we've had so far,
03:33because it's a table with colors and shading in it.
03:37Also, it's not showing me everything right now.
03:39This is actually what's called a model viewer.
03:41Now right now, all it's telling me is that the proportion of people who say
03:45they've read a novel, poem, or play in the last year is significantly
03:48different from 50%.
03:50It's not telling me what's the actual proportion was or how far away it is, but
03:54I can get that through going onto the Model Viewer.
03:57I'll double-click here and it brings up the Model Viewer.
04:00I'll maximize that window. And what I have here is the output that I saw on the other page.
04:05It tells me that the proportion of people who say they've read one of these is not 50%.
04:09It's significantly different from 50%.
04:12In fact, what I can do is I can come over here and the hypothesized, that 50%, is
04:16this blue bar right here.
04:18But what I really have is an observed 71% of the people say that they've read a
04:22novel, poem, or play in the last year.
04:25That's out of 349 people, and this tells me that that is significantly different from 0.
04:31To get the confidence interval, I need to do one other thing.
04:34I come back over to this left pane and I go down to where it says View.
04:38Right now we're looking at the Hypothesis Summary.
04:41If I click on that, I can get the Confidence Interval Summary.
04:45It's a slightly different table here, and it tells me how it calculated the
04:49confidence interval by using the Clopper-Pearson.
04:51It tells me what the Parameter was, the probability that a person read a novel, a
04:55poem, or play in the last year.
04:57It tells me that the proportion of people who said yes, because they put ones
05:01instead of zeroed, is 71%. That corresponds to what I have over here.
05:06The yes is the 71%.
05:08The confidence interval at the 95% confidence interval, which is the most
05:12common, is from 66% to 76%.
05:16And what this means is that while in my sample of 349 people 71% may have said
05:22they've read these, in the population of those 349 people came from, the true
05:26value could be somewhere between 66% and 76%.
05:30This is like the plus or minus 5% that you would get from a political poll.
05:35So the new nonparametric tests in SPSS is actually a very flexible procedure
05:40that can perform an entire range of tests all on its own.
05:43It's also the easiest way to get confidence intervals and hypothesis tests for
05:48a single proportion.
05:49We'll come back to this procedure in another movie on testing nominal variables
05:53with multiple categories, but for now this should give you a good start on
05:57dealing with inferential statistics for dichotomous variables in SPSS.
06:01In the next movie, we'll look at common tests for scale variables.
Collapse this transcript
Calculating inferential statistics for a single mean
00:00SPSS makes it very easy for you to go beyond your sample data and make
00:05inferences about the population that those data came from,
00:08that is, you can calculate inferential statistics.
00:11In the last movie, we looked at how to work with proportions for a single
00:14dichotomous variable--
00:16that's a yes/no, 0/1 variable-- to get a hypothesis test and a
00:20confidence interval.
00:21In this movie, we will do the same procedure for a scale variable, something
00:25that could be measured in set units, like time to complete a project or bids from vendors.
00:29I am going to use the same data set for this one, the GSS, or General Social
00:33Survey.sav, data set, and this time I'll be looking at the one variable here that's
00:39called FamilyIncome that measures the total family income in dollars.
00:43Now I should point out that these are actually the midpoints for categories,
00:47which is why they seem to be very precise amounts, and you will see them repeated,
00:52like here's 115,841, and here's the same number again.
00:57Nevertheless, these are scale variables because the dollars move in set amounts.
01:02So I am going to be doing a hypothesis test and a confidence interval for the
01:06family income for the 349 people in this particular sample.
01:10Now there's two ways to do this, and both of them go in the Analyze menu.
01:15For the first one, I am going to come up to Analyze and I am going to go Compare
01:19Means and I am going to use what's called the One-Sample T-Test.
01:23And all I need to do here is I need to pick the variable that I want.
01:27In that case, it's FamilyIncome.
01:28So I just double-click on that and it moves it over.
01:32Let's look at some of the options.
01:34I can get a confidence interval, and I can change it from 95% to some other
01:38values, sometimes 90% or 80% is appropriate, but 95% is the most common.
01:44So I am going to leave it right there.
01:45So I'll click Continue.
01:47I'm going to ignore the bootstrap, because that's there because of an extra
01:51add-in that's installed in this version of SPSS that normally you have to pay for.
01:56Below the test variables box, I have another box that says test value, and this
02:01is the value that SPSS is going to compare the mean family income to, to find
02:06out if it's significantly different from it.
02:08Now I can guarantee you that the mean family income is not going to be 0,
02:12so I am going to pick another number to put there.
02:15Let's say, for instance, I want to compare it to $45,000 for family income.
02:20This is how I can do it to find out whether this average value is higher or
02:24lower than that significantly.
02:25So now I click OK, and what I have is one sample of statistics.
02:30It tells me that I have 349 people, that the mean family income is $32,781 with a
02:37standard deviation of 29,000.
02:40The last one, the standard error, is used in calculating the hypothesis test and
02:45the confidence intervals.
02:46Below that I have what's called a One- Sample Test where SPSS is taking the average
02:52value, the mean of 32,781, and comparing it to a hypothesized value of $45,000.
03:00The first column has what's called the t statistics, and that's an inferential
03:03statistic, and it doesn't necessarily mean a lot on its own.
03:07The second one is the degrees of freedom, which has to do with the sample size.
03:10It's the third one in particular that we want to look at. It says Sig. (2-tailed).
03:15That's the significant value, or the probability value for the hypothesis test.
03:19And in this case that number is 000.
03:22Now it's not literally 0.
03:23it's just it's less than 001,
03:26so it shows up truncated here.
03:28What this tells me is that the observed average value of $32,781 per year for a
03:35family is significantly different from my hypothesized value of 45,000.
03:40I was optimistic in my hypothesis.
03:43Now these last two columns have what's called confidence interval for the
03:46difference from the mean.
03:47You see that the mean difference that's in the third column from the end is -12,000.
03:51That's because the observed value is about $12,000 less than my
03:56hypothesized value.
03:58These last two columns give me the confidence interval for that difference.
04:02Now an interesting thing here is had the hypothesized value been 0, these would
04:06have been an actual conference interval for the mean, but because I felt that
04:11having 0 would be a silly test value, I put something else in. The confidence
04:16interval is for the difference.
04:17Now if I want a regular confidence interval, a better way to get that, instead of
04:22from the T-Test, is to go back to a procedure we looked at in the last set of the
04:26videos, the Explore command.
04:28I just go back up to Analyze > Descriptive Statistics > Explore.
04:34I take the one variable that I want out of this list, which is FamilyIncome, and
04:38I put it into the Dependent List, that means outcome variables, or the ones we
04:41are trying to chart. All I want here is a list of statistics.
04:45I am going to come down to Display and click on Statistics and press OK.
04:50I'm going to get a big table here, but the only one I really want to look at is
04:54this one that says 95% confidence interval for the mean, with the lower bound
04:57and the upper bound.
04:58There's actually several ways of interpreting a confidence interval, but one
05:02sort of colloquial way is to say that the population value is between 29,692 and
05:0935,871, so between 30,000 and 36,000.
05:13There is about 95% chance that the true population mean is between those two values.
05:19Anyhow, SPSS makes it simple to perform two of the most basic and two of the
05:24most useful inferential statistics for a single scale variable:
05:28the One-Sample T-Test and the simple confidence interval.
05:31In the next movie, we will look at something slightly more complicated as we
05:34look at the distribution of cases across a nominal variable with several groups.
Collapse this transcript
Calculating inferential statistics for a single categorical variable
00:00In the last two movies we've looked at the most basic inferential statistics,
00:05the ones where we analyzed one variable at a time.
00:08We looked at the proportion for a nominal variable, with only two outcomes, that
00:12is, a dichotomous variable, and we looked at the mean for a scale variable.
00:16In both cases, we looked at both null hypothesis tests and confidence intervals.
00:21In this movie, we will expand things slightly by looking at how to do a
00:25hypothesis test for a nominal variable, or a categorical variable, that has more
00:30than two categories, something like occupation or a favorite sport.
00:34Although it's possible to do confidence intervals for the number of people in
00:37each category, it's a complicated procedure, and it's not particularly
00:41helpful for most purposes.
00:43Instead, we'll just do a hypothesis test that looks at whether people are evenly
00:47distributed across all the categories in the variable.
00:50The test statistics that we'll use is called the One Sample Chi-Square Test in SPSS.
00:56It's also known as the Goodness-of- fit Test, and with SPSS's new automatic
01:01features, this is very easy to create and interpret.
01:05I am going to be using the same data set as before, GSS.sav from the General
01:09Social Survey, and I thought it might be interesting to look at the variable
01:13that is second from the last, about people feeling happy.
01:16Specifically the question is self-rated happiness.
01:19Well, we have three possible answers:
01:21Not Too Happy, Pretty Happy, and Very Happy. And we can use this test to see if
01:27people fall evenly into those three different categories.
01:30To do this, we'd go to the Analyze menu, and then down to Nonparametric Tests, and
01:35again to One Sample.
01:37This is the same one that we used for the single proportion.
01:40We're just going to be doing it a little bit differently this time.
01:44I need to go to the Fields tab, and then I have all of the variables that it can
01:49test in the Test Field thing. I don't want all of them there.
01:51It will be too much output.
01:53So what I am going to do is I am going to select all of these and put all of
01:57them back, and then I'll bring back over the only one that I want, which is near
02:01the bottom of the list, and it's Self-Rated Happiness.
02:04I can double-click on that to move it over.
02:07Then I can go with the default test. All I need to do now is press Run, and I
02:12get the same kind of table I got before.
02:15It lets me know that the null hypothesis, or that the categories of Self-Rated
02:19Happiness, would occur with equal probabilities.
02:22That is that we would have the same percentage of people who said that they were
02:25Not Too Happy and Pretty Happy and Very Happy.
02:29All I can tell from this one is that those three are not evenly distributed.
02:33But this is an interactive model viewer,
02:35so I double-click on it and I will maximize that window.
02:39And what I see is the hypothesized value is the green bars over here, and what
02:45it is, you see all three of them are the same size.
02:47The blue is how many I actually have. The green is how many I would have
02:50expected if things were distributed evenly. And it tells me that I have an
02:54observed 43 people who said they were not too happy.
02:57That's this blue bar right here, that's the Observed.
03:00The hypothesized was 116.
03:03So the difference between the two, the residual, is 73.
03:06In fact, what you can see is that in the first set, the Not Too Happy, I have
03:11fewer people than I would expect if people were evenly distributed.
03:14On the other hand, I have a lot more people in the middle set, Pretty Happy,
03:18than I would expect.
03:20The Very Happy is actually right around one third of the group.
03:24Down below that, I have a table that gives me the total sample size, 349.
03:28The Test Statistic there is called the Chi-Square Test, and it's got a value of 88.06.
03:34It has what's called 2 Degrees of Freedom, and a Probability value, that's the
03:38Asymptotic Significance 2-sided test of less than 000.
03:42Again, it's not exactly 0, but it's going to be a small number.
03:47Anyhow, this is the easiest possible hypothesis test for a categorical variable
03:53that has several categories in it.
03:56The One Sample Chi-Square Test, it's a quick and easy way to tell if your
03:59observations are distributed evenly across categories, or you can also specify
04:05some other expected way.
04:06It shows how important it can be to check whether the variation you see could be
04:10reasonably attributed to random, meaning less chance, or whether you might start
04:14to see something important that deserves further analysis.
Collapse this transcript
7. Charts for Two Variables
Creating clustered bar charts
00:00The last several sections of movies have dealt with methods for examining
00:04one variable at a time with graphs, descriptives, statistics, and inferential procedures.
00:10These kinds of univariate analyses can be very interesting in their own right,
00:14such as the number of people to vote for a particular political candidate or
00:18the amount of money spent on chewing gum in the US each year, which I've heard
00:21once is $500 million per year. And they form a truly essential part of any further analysis.
00:28That is they are foundational essential background pieces of an analysis.
00:33So before you look at any combinations of variables you need to understand each
00:37variable on its own. But with that said, it's the associations between
00:43variables that are often of the most interest to people.
00:46For example, I am also told that people chew gum more often during times of social unrest.
00:51Now, you can make with that what you will, but it gets at the heart of the
00:54great majority of real world data analysis. How can you predict or explain one
01:00thing based on another?
01:01And as a first step to understanding associations, like we did with
01:06univariates, we're going to start where you should always start in an
01:09analysis: with a picture.
01:11One of the easiest kinds of charts for showing associations is the clustered bar chart,
01:15which is particularly well suited for showing the relationship between two
01:19categorical variables.
01:21For instance, Normal or Ordinal variables.
01:24We covered simple bar charts earlier when we looked at univariate charts and
01:28they can be just as useful here.
01:30In fact, the only real difference is that we will now cluster variables by
01:35grouping them on the axis across the bottom.
01:38While the difference may seem small, it really opens up a lot of analytical
01:42possibilities in SPSS.
01:44Now, to demonstrate this, I am going to be using the data set Searches.sav, about
01:50Google searches, and how they vary from state to state.
01:53In this particular example I am going to look at two variables that are near
01:56the end on the right.
01:57What I am going to look at, whether a state has an outline for a high
02:02school statistics class and I am going to compare that to the region of the
02:06country that they are in.
02:07There are four regions.
02:08So that's a categorical variable with four categories and statistics education
02:13is a dichotomous yes/no.
02:15And I am going to look and see if the proportion of states with statistics
02:20curriculum varies from one region to another.
02:25Now, to do that, I am going to go up to Graphs, to the Chart Builder, and I am
02:30going to come down to Bar chart and choose clustered bar charts.
02:35I am going to drag that up to the canvas and then I need to take one variable
02:40and put it in the X-axis and the other variable to set the colors of the bars.
02:45What I am going to do is I am going to put the region in the X-axis, and for no
02:48other reason I have four regions and I don't want to have four different colors
02:52in my chart, but also you're going to see how this allows me to make a yes/no
02:56comparison more easily between each group.
02:59What I am going to do is I am going to get the region variable, which is near the
03:02bottom of the dataset.
03:03That's this one right here, the Census Bureau Region.
03:06I am going to drag that down to X-axis and then for this one on the top-right
03:11that says Cluster on X: set color,
03:13I am going to take whether they have an outline for high school statistics.
03:17That's this variable right here.
03:19So I am going to drag that over to cluster, and I think that's all I really need right here.
03:25So I am going to come down and click OK.
03:28When we first get the output, we get a lot of text.
03:30This is the command that you could write to produce this chart.
03:34Beneath that is the chart itself.
03:36It's just blue and green bars, and what it has is a pair of bars for each Census
03:41Bureau Region from the Northeast, and the Midwest, the South ,to the West, and
03:46the blue bar means that the state does not have an outline for high school
03:50statistics class, but a green bar means that it does.
03:53There are a couple of things that jump out immediately. First, is that in the
03:57Northeast not a single state has an outline for a high school statistics class.
04:02The Midwest has just one, and the West has just three, but the Southern
04:07region, there are more states that have outlines or high school statistics,
04:11than there are without them.
04:13That's extraordinarily unusual.
04:15That's a very different pattern.
04:17Now there is one challenge with this particular chart and that is that there is
04:23not the same number of states in each region, and so it can make it a little
04:26difficult to compare from one to the other.
04:29Fortunately, the Bar Chart command lets us do something significant here.
04:34What I am charting right now on the side is the counts.
04:38That's the number of states that do or do not have an outline for a high
04:42school statistics class.
04:43I am going to change that though to be a percentage and here's how we're going to work.
04:48I am going to go back to Graphs, to the Chart Builder, and I am going to pick
04:54up where I left off, except right here it says Count on the side, and if I go
04:59over to the Element Properties window where it says Bar, right here under
05:04statistics it says Count.
05:05If I click on that, I actually have a huge number of options.
05:11I can specify tremendous number of things.
05:13What I am going to do is I am going to click Percentage.
05:16Now the reason that has a question mark in parenthesis
05:19is because I need to set the parameters for the percentage.
05:22It's asking me a percentage of what?
05:25I click on that. I don't want the grand total.
05:28What I do want is each X-axis category, that is, each region.
05:33I want to know what percentage of the schools in each region do or do not have
05:39a high school statistics curriculum.
05:41So I am going to click on that one and press Continue, then I come down to the
05:45bottom of the Elements window and press Apply, then back over to the main
05:48window and press OK.
05:50We get the text output and then I scroll down and I have another chart.
05:55And you can see this one looks slightly different and it's because it's
05:58adjusting it for the differences in the sizes of the regions.
06:01We still see that in the Northeast none of the schools have an outline for =
06:06high school statistics class.
06:07That's why the blue line, the No, goes all the way up to 100%.
06:11In the Midwest, only 10% of the schools, in the South, over 50% have a
06:17curriculum, and in the West, it's just over 20%, and that's another way of
06:23adjusting for differences to make a little easier to interpret. You usually want
06:27to compensate for the differences in the sample sizes and look at the
06:31percentages or the rates in a particular area, and that's one of the beautiful
06:35things about SPSS, is how easy it makes that particular procedure.
06:39So the first kind of association chart that we've covered, the clustered bar chart,
06:44 is a small variation on a univariate bar chart, and it's a great way of
06:48showing the association between two categorical variables.
06:52This command makes a very clean, simple, and easy to interpret chart, which is
06:57the real goal of data visualization, is statistical graphics.
07:01In the next movie, we will look at using scatter plots to show the associations
07:06between two scale variables.
Collapse this transcript
Creating scatterplots
00:00In the last movie we talked about how to chart the relationship between two
00:04categorical variables with clustered bar charts.
00:07On the other hand, if you have two scale variables, also called
00:10quantitative variables or measured variables, then your best choice is
00:13almost always a scatter plot.
00:15Scatter plots are familiar to most people.
00:18There's an x axis across the bottom and a y axis up this side, and each person
00:22or case gets a dot to show the combination of their two scores, like height and
00:27weight or high school and college GPA.
00:30In general you want to put your predictor variable on the bottom, on the x axis,
00:33and your outcome variable or the thing you're trying to predict on the y axis,
00:37and SPSS makes the whole process very simple.
00:41You can create a scatter plot with the Chart Builder in just a few steps.
00:45And for this example I'm going to be using the same Google searches
00:48information in Searches.sav.
00:50I am going to come up to Graphs, to Chart Builder and then in the Gallery I
00:55will choose Scatter, and just use a Simple Scatter plot.
00:59I will drag that up to the canvas.
01:01And then in this particular example, I'm going to take interest in SPSS as a
01:06relative interest as a search term and put it on the x axis, and then I am going
01:11to take one that may seem a little peculiar, but the search term, Totally Lost,
01:16and put that on the y axis.
01:18I'm also going to make it possible for me to identify points by clicking on
01:22the Point ID label.
01:24That brings up a box in the canvas.
01:26and I can come up here and I can take the state code and drag that in and that
01:31should be enough for right now.
01:33I'll click OK and here's my general scatter plot.
01:37And what you see is first off a lot of fuzz, because I have dots and I have the state labels.
01:42I am going to take care of those in just a second.
01:44But it's clear that there's a very strong linear uphill trend, that places that
01:50show greater relative interest in SPSS as a search term in Google also for
01:55reasons that may not be totally clear show greater use of the search term
02:00Totally Lost as they go through.
02:03Now, I am going to clean up this chart in a few ways.
02:06I am going to try to go through it relatively quickly and give you an idea
02:09of what's possible.
02:10To edit the chart you need to double- click on it, and what I am going to do is
02:14I am going to turn off all of the state labels by going to Elements and Hide Data Labels.
02:19I will bring back just one or two of them for illustration later.
02:23There's a few things I want to show you how to clean up.
02:25For instance, you can change almost anything by clicking on it.
02:29I have selected the data points here and I can make them instead of black
02:33circles, I can make them red dots by clicking red for the Border and then red for the Fill.
02:40If I want to change the colors of lines, I can do that as well.
02:43I can also change the axis down here from 3 decimal places by clicking on Number
02:48Format and changing that to 0, clicking Apply, and doing the same thing over
02:54here, changing that to 0 and clicking Apply.
02:57Now what I am going to do is I am going to add a linear regression line.
03:00This is also the basis of an inferential procedure, linear aggression, that
03:04we'll be coming to a little bit later, but right now it's a very simple thing to do.
03:08I just come up to the Button bar and click on this one that says Add a Fit Line
03:13at Total, and that's a regression line that goes all the way through.
03:17It also adds a little bit of information right here that I don't need right now,
03:20so I am going to select that and press Delete.
03:22And then I've got a very clear, strong, upward trend, higher relative
03:27interest in SPSS as a search term, also higher use of the word Totally Lost as a search term.
03:33The one last thing I'm going to do is I'm going to add an identifier to the
03:37point that's in the top right.
03:38We saw what it was earlier, but I am going to add an identifier for just it.
03:43By coming over to the left of this button bar, clicking on the little target,
03:47which is the Data Label Mode, I click on that, and then I come back over and
03:51click on that data point I want to identify, and we see there that it's
03:54Washington D.C., and that's probably enough for this particular chart.
03:58I want you to be aware that there are many other options.
04:01For instance, I can add vertical and horizontal reference lines.
04:06I can also change the kind of regression line I have through.
04:10For instance, this is called a linear regression line, but if you're interested
04:14in growth, like changes in stock prices over time, you might want to use a
04:18Quadratic or something called a Cubic.
04:20If you want to see if it's a straight line at all, you can find what's called a
04:24Smoother, in this case it's called Loess Smoother through the regression line,
04:28and I encourage you to try these alternatives, and it's actually possible to
04:33overlay one on top of the other. But for now I am going to leave this with a
04:37straight regressioline as it shows the linear patterns most clearly.
04:41So I am going to close that and close that.
04:44So the Scatter plot can give really good insight into the relationship
04:48between two scale variables and the options that SPS gives for lines through
04:52the data can help you explore how well your data matched the assumptions of
04:56standard linear regression.
04:58In the next movie we'll look at a special kind of scatter plot called the
05:02Time Series Plot or Time Plot, where the variable on the bottom is, not surprisingly, time.
Collapse this transcript
Creating time series
00:00In the last movie, we looked at how to create scatter plots for two quantitative
00:04variables or scale variables in SPSS.
00:08Now scatter plots are extremely useful, for exploring new data, and they're
00:12also extremely flexible.
00:14One variation on this scatter plot though deserves special mention.
00:18The time series scatter plot or time plot.
00:21As you might guess, the major difference in this case is that the variable that
00:25goes across the bottom on the x-axis is some measure of time.
00:29Another difference is that time plot often have only one measurement for each
00:33time period whereas scatter plots can have, for example, lots of people who are
00:37all at the same point on the x-axis.
00:41Because all time plots usually have only one observation at each point in time,
00:45you can also connect the points, which makes it more like a line chart.
00:49And here's how it works in SPSS.
00:52For this example, I'm going to be using the data set that's called NDAQ.sav.
00:57And what this is the price for shares in the NASDAQ Exchange itself,
01:02from 2002 through 2011.
01:05It only has two variables.
01:07It has the first market day of each month and it has the closing price on
01:12that day for each month.
01:14Let's go up to Graphs and then to Chart Builder and then down to Scatter and
01:21choose the Simple Scatter, the top left one, and drag it into the canvas.
01:25The Date will go on the bottom and the closing price for the NASDAQ stocks will
01:31go on the left, and that's how we need to do right here. I am going just going
01:35to click OK, and what you see is a lot of dots.
01:39Now, you can see the pattern.
01:41It starts relatively low in '04 or '05, shoots way up high in '06 and 'O8, comes
01:48back down to earth in 2010, and then starts to go back up again.
01:53But there's a way to make this chart much clearer.
01:56We just need to edit it, and do a few different things.
01:59So to edit it, like every other chart first we double click on it to open
02:03up the editing window.
02:05And for this one, what we want to do is we want to click on the button in the menu bar here.
02:09It's called Add Interpolation Line.
02:13And what this does is it draws a line that connects every dot across the bottom.
02:19This is the standard line plot, you would expect every time. Now, if we stop right there,
02:24it's not bad.
02:25However, at this point the dots actually get in the way, and so what we can do,
02:30is we can click carefully on the dots.
02:33So they are all selected and just hit Delete, and we our left with the line
02:37plot that shows the pattern more clearly than the dots themselves, of things
02:42starting slowly, skyrocketing and then coming back down at the end of the dotcom bubble.
02:47And that is a special case where the predictor variable is time and you can
02:52adapt the standard scatter plot to show how a variable changes, in which case
02:57it's now called the Time Series Scatter Plot or Time Plot.
03:00This is a good example of how SPSS helps you customize your charts to make
03:05them easier to read and more useful in interpretation.
03:08Up to this point, we have looked at charts for the association of two
03:12categorical variables and two scale variables.
03:15In the next few movies, we will look at the combination of the two kinds:
03:18charts that show the association of one categorical variable and one scale variable.
Collapse this transcript
Creating simple bar charts of group means
00:00In this section on charts for the associations between variables, we've looked
00:05at how we can depict the association between two categorical variables,
00:09for example, with clustered bar charts, and the association between two scale variables,
00:14for example, scatter plots.
00:16At this point, we'll move on to charts that show the association between two
00:21kinds of variables. THat is, charts that look at one categorical variable and
00:25how's it's connected with the scale variable.
00:27Whereas the other combinations of variables had clear preferences for the charts.
00:32there are actually several useful options for charting associations for
00:36categorical and scale variables in combination.
00:39The first of these is a simple variation on the bar chart, adapted to show the
00:43mean score for each group.
00:45In this example, I am going to use the GSS dataset and I'm going to show family
00:50income as a function of the highest level of education of the respondent.
00:55To do that, I first go up to Graphs and click on the Chart Builder.
00:59From there, I come down to Bar in the Gallery and I simply drag this simple
01:04bar into the canvas.
01:06On the X-axis, I am going to put my categorical predictor variable, which is the
01:10highest degree of education.
01:11That's called highest degree, and I drag that down to X-axis.
01:15Now on the left of that, on the Y-axis it says Count.
01:19However, if I come to the variable list and I get family income and I drag that over,
01:24it changes from Count to Mean.
01:27That's because it's a scale variable.
01:29Now if I wanted to, I could get other statistics.
01:32I could get the Median, the Group Median, the Mode, and truthfully, a very large
01:37range of statistics, but I am going to leave it with the Mean.
01:40I am going to do one small variation, however.
01:42I am going to ask it to put on what are called error bars
01:44confidence intervals.
01:46These give some sort of indication of what the difference might be in the
01:49general population, as opposed to just a sample.
01:52Once I check that, then I need to come down and click Apply and then I come
01:56over to the box and I click OK.
01:59And here we see five bars that show different levels of education, from Did Not
02:03Finish High School, which has an average family income of about $20,000 a year
02:08in this particular data set, off through Bachelor's Degree and Graduate Degree,
02:13which have averages of about $50,000 a year in this particular data set.
02:17Now I do feel it's important to clean this chart up a little bit, so like the
02:21others what I'm going to do is I am going to double-click on it and I am
02:25going to make a few clarifications, because you want to reduce the amount of
02:29clutter in the chart.
02:30So what I am going to do first, so I am going to click on this thing that says
02:33Error Bars and just delete that.
02:36Then I am going to change the error bars, because I find the end to them
02:39distracting. I come up to Bar Options and change them to just Whiskers here
02:44under Boxplot and Error Bar Styles.
02:47Click OK. I am going to change the color of the bars. I find that
02:50an unattractive color.
02:52Maybe I will make it a light green and then I might want to make the text here
02:59a little bit larger.
03:00Now I could do something interesting when I do that. There we go.
03:04It just changes the space a little bit and I find this to be a much clearer
03:08diagram of the relationship between the two.
03:11So I am going to close this now.
03:12I'll close there and then I'll come up to the editing window and click the red X
03:16and there you have it.
03:17A bar chart that shows the association between income and between levels of education.
03:23So bar charts are a great way to show the association between categorical
03:27variables and scale variables in general.
03:29They are very clean and very easy to interpret.
03:32As a note, one of the nice things about SPSS is that it keeps things clean.
03:36So while it's possible to edit the bars and give them shadows or a foster
03:40dimension, those options are hidden, which is good, because they are
03:44almost always bad ideas.
03:46Those sorts of effects are often called chart junk and most spreadsheets
03:50and presentation packages make it way too easy to engage in these
03:53unfortunate practices.
03:55SPSS on the other hand keeps thing simple, keeps them clean, and keeps them easy
03:59to interpret, which is the entire purpose of data graphics.
04:02Anyhow, with that in mind, we'll move from bar charts to a fancier kind of
04:07display for the association between a dichotomous variable, that is one
04:11which has two categories and a scale variable, using something called a
04:15population pyramid.
Collapse this transcript
Creating population pyramids
00:00In the last movie we looked at how you can create pie charts to show the mean or
00:04maybe the median, for each group on a categorical variable.
00:08However sometimes, it can be more helpful to see not just a single summary
00:12statistic, but the entire distribution of scores for each group.
00:16One way to do this, provided your categorical variable is a dichotomy, that is it
00:20has just two values, is a variation on the histogram or bell curve that we
00:24looked at back in the section on univariate charts.
00:28In this case what we are going to create is a pair of back-to-back histograms,
00:32what SPSS calls a population pyramid.
00:35For this example, I'm going to be using the Searches.sav data file, and I am
00:40going to be looking at relative interest in NBA, as a search term, and compare
00:46that with whether a state has an NBA team or not.
00:49Now I am going to do this by going up to Graphs, to Chart Builder, and from
00:54there, I come down to Histogram, because the pyramid plot is a variation on the Histogram.
01:00This one on the far right, Population Pyramid, I drag that up to the canvas,
01:05and then what I'm going to do is I am going to come on this variable list and
01:09scroll down until I find the results for NBA as a Google search term, and I
01:15take that over to the distribution variable. We are trying to find out how common that is.
01:19Then I am going to split it by whether the state has an NBA team.
01:24That's this variable right here and I take that up to the split variable, and
01:28from there I can just press OK.
01:31And what we find in this one is that the states that have an NBA team, the
01:36ones on the right side in the green, tend to have the higher scores on the
01:41relative interest in NBA as a search term in Google, as opposed to the states
01:45that don't have NBA teams.
01:47For instance, on the right we see that there are two states that have relative
01:52interest in NBA, right around three standard deviations above the mean.
01:56On the other hand we see of the states that don't have NBA teams, a lot of
02:01them are below zero, around negative one.
02:04And so this is a way of looking at things back to back in Histogram and making
02:08the differences between the two sets really obvious.
02:11Now if you want to, you can double- click on this chart and you can change the
02:16colors on each side,. You can change the bins.
02:18You can change the number of decimal places on the side, the same way that we've
02:23edited nearly everything else.
02:25But this one is probably clear enough as it is.
02:28So a population pyramid, that is, a back- to-back histogram, this can be a new way
02:34to compare the distribution of a scale variable across two different groups.
02:39Like a regular Univariate Histogram, it lets you examine the shape of the
02:42distribution, let's you check visually for outliers, and lets you identify any
02:46possible quirks in the data that might throw off later analyses.
02:50In the next movie, we will look at one final display for showing the
02:53association between the categorical variable and skilled variable, what's
02:58called grouped boxplots.
Collapse this transcript
Creating simple boxplots for groups
00:00In this movie, on graphing the association between two variables, we will
00:04look at what SPSS calls simple boxplots, which is a series of boxplots for
00:09a single scale variable, broken down by the groups in a single categorical variable.
00:15One of the main benefits of this particular chart is that it allows you to check
00:19for outliers separately for each group.
00:22This is important because a variable may not have any outliers, when all of
00:26the cases are considered together, but can have an outlier when groups are separated.
00:31For example, enough people in the sample might be 6'4" tall, that it might not
00:36be considered an outlier overall, but that it almost certainly would be an
00:39outlier, if you looked at the heights of men and women separately.
00:42So, here's how to break boxplots down by various categories.
00:47For this example, I am going to be using the Searches database again from
00:51Google, Searches.sav, and except in this case I am going to be looking at the
00:55relative interest in search for this one variable, Modern Dance as a search
01:00term and break it down by region.
01:02To do this, I am going to go up to Graphs, to Chart Builder, and I am going to
01:07come down to Boxplot, and I am going to take this first one which is called
01:11the Simple Boxplot and drag it up to this canvas, and from there I'm going to
01:16get the Region variable, that's this one, Census Bureau region, and drag that down to the X axis.
01:22Then I'm going to get the variable that shows the relative interest in Modern
01:27Dance as a search term. From there I'm going to add group and point IDs. This is
01:32helpful when you're labeling outliers, which often show up in boxplots.
01:37So I'm going to come down and click on Point ID label, and then I am going to
01:42get the State Code from the variable list, and drag that over, and that's all I
01:47need for right now. So I am going to come down and press OK.
01:51And what you find rather surprisingly is that Utah is an extraordinarily
01:57high outlier on the far right, been four-and-a-half Standard Deviations above
02:02the national average in the relative mind sharing interest in Modern Dance as
02:07a Google search term.
02:08You might associate Modern Dance with the city like New York and the Northeast,
02:14and you do see that New York is an outlier on the left side, but still it's at
02:18only about a value of one standard deviation above the mean.
02:22And you can see that there are others at a much lower interest and the Midwest
02:25is generally below 0, that they are negative.
02:29And so, this is a good way of looking at the relative differences in
02:33distributions especially in outliers of one group across another.
02:38The Simple Boxplot is a great way to compare the distributions of a single
02:42scale variable, for the different groups in the categorical variable, and again
02:47because it's especially important to identify outliers because they can wreak
02:52havoc with the statistical procedures,
02:54it's an important consideration before going on to further analysis, like the
02:58inferential statistics for associations that we will cover in the next several movies.
Collapse this transcript
Creating side-by-side boxplots
00:00In the last movie on graphing, we looked at how SPSS could create boxplots for
00:05a single scale variable broken down by the groups and a single categorical variable.
00:10Another variation on boxplots that can be handy is to show boxplots for several
00:15different variables side-by-side, and while this isn't technically a chart of
00:20the association between variables,
00:21it's a very useful chart that addresses multiple variables.
00:25These side-by-side boxplots work well as a shortcut method for checking outliers
00:31on several variables at once.
00:32They are a great presentation graphic for showing the distribution of several
00:37variables and that way they could be considered a much more compact alternative
00:42to showing multiple histograms.
00:44The only real catch is that your variables need to be on the same scale, for
00:48instance they could all be opinion questions on a 1 to 5 strongly disagree to
00:53strongly agree scale, or they could all be dollar values in thousands of dollars.
00:58The other trick is that this feature was not included in SPSS's otherwise
01:02remarkable and comprehensive Chart Builder. Instead we will need to use what
01:06SPSS calls a legacy dialog and here is how it works.
01:11For this example I am going to be using the Google Search's information because
01:14I have multiple interesting variables on the same scale.
01:17I am going to go to Graphs, down to Legacy Dialogs, and from there I go down near
01:25the bottom to Boxplots.
01:27Now I have a choice here of Simple which means without breaking things down by
01:32group or Clustered where I am breaking things down by groups.
01:35In this particular case I want to choose this option that says Summaries of
01:39separate variables, and I click Define.
01:42All I need to do is pick the variables that I want to put in.
01:45Just to show what you are able to do, I am going to take all of the Google
01:48Search terms from SPSS down through FIFA and put them into Boxes Represent.
01:56Also, because when you are looking for outliers you often want to know who they are,
02:00I am going to take the State Code variable, right here, and put that in here to
02:05Label Cases by, and that's all I need to do.
02:09Now, I click OK and what we get is the syntax pasted at the top and then we have
02:15what's called a Case Processing Summary.
02:17It's simply SPSS telling me how many cases it used, that we had valid data on all
02:2351 cases, which is convenient.
02:25And then below that is the actual chart.
02:27Now this is a very busy chart and I am going to show you there is a couple of
02:31ways that we can clean this up and make it even easier to deal with.
02:34I am going to double-click on it and the first thing I am going to do is I am
02:38going to transpose the chart and turn it sideways by going to the upper-right
02:43and clicking on this button that says Transpose chart coordinate system.
02:47From there, I can change various elements of the chart.
02:50I am going to change the colors by double-clicking on those and I will just
02:55change them to something else.
02:58Also, I am going to change the markers for the outliers and I will make them a
03:03little smaller and I will put them in the same fill and apply those.
03:09I will do the same thing for Utah over here, except that's nearly invisible now.
03:18I will use a darker one. There we go!
03:24Okay, then I'll make the text over here slightly larger and what I can see from
03:32here is that each of these variables was designed by Google to be centered
03:38around 0 because that's the national average.
03:41What it's showing us is states that are above or below the national average.
03:45We see for instance that Washington D.C. is an outlier on several of them, for
03:51Totally Lost, for Data Visualization, and for Statistically Significant as well
03:55as Regression and SPSS.
03:59We can see that there is only one low outlier anywhere, and that's Arkansas on American Idol.
04:05Finally, the furthest outlier we have on anything is on Modern Dance and it's
04:12Utah, which is over 5 standard deviations above the national average which is
04:16pretty extraordinary.
04:17Anyhow, you can see that a side-by-side boxplot gives a quick and a compact way
04:23to look at the distributions of several scale variables at once.
04:27You can check for outliers. You can also use them as presentation graphics.
04:31It's a handy alternative to multiple histograms and you should always consider
04:35the side-by-side boxplots when you have several scale variables that you want
04:39to analyze together.
Collapse this transcript
8. Descriptive and Inferential Statistics for Two Variables
Calculating correlations
00:00Whenever you explore your data you'll find that each step can build on
00:04the others before it.
00:06In this course for example we started by looking at individual variables
00:10before looking at pairs of variables and that comes before looking at sets of variables.
00:15When we looked at individual variables we started by creating graphic
00:19displays for each variable.
00:21Then by computing descriptive statistics for each and finished with
00:24inferential statistics.
00:25There is a logical progression to this and it's one that we will follow here
00:30with the associations for pairs of variables and later for sets of variables.
00:35The first procedure that we are going to look at, correlations, is the most
00:39general measure of association between pairs of variables.
00:42Let's look at how to do correlations in SPSS and how to interpret the results.
00:46For this example, I'm going to be using the same dataset I've used in the last few.
00:51It's about the Google Searches, Searches.sav, and to get correlations we need to
00:57go up to Analyze and then we come down to Correlate, and what we are going to be
01:02doing is the basic version called Bivariate or two variable correlations.
01:07All you need to do here is take all the variables that you want to correlate
01:10with each other and put them in the variable list on the right.
01:15Now if there is one variable in particular that can serve as an outcome
01:18variable, it's helpful to put that one in first so it shows up at the very top of the list.
01:24In this particular example I thought it might be interesting to look at the
01:27relative interest in searching for Facebook.
01:30So I am going to put that in first, and then I'll see how that compares with
01:34other search terms by selecting all of these, and I might as well put in
01:38nearly everything here.
01:41I am going to come down to Median Age, because all of these are either scale or dichotomous.
01:48Now I am not going to put in Census Bureau Region because that has four
01:52categories and Census Bureau Division because it has even more.
01:56However, you can use indicator variables and what I've done is I've created
02:00three indicator variables.
02:02One for whether a state is in the Northeast, another for the Midwest, and a
02:06third for the South, and what that does is it leaves implied in all of these is the West.
02:12So I am going to add the three of those and put them over here.
02:17Now I have a few options with correlation.
02:19I can get three different kinds of correlations.
02:22There is the Pearson Product-Moment Correlation coefficient which is the
02:25standard correlation, also sometimes known by its symbol R. There's Kendall's
02:30Tau-b and there is the Spearman rank order correlation coefficient.
02:34Truthfully, I've never had to do with anything other than the Pearson and I
02:38recommend that you stick with that one.
02:39There's also Test of Significance.
02:43You can do what's called a one- tailed test or a two-tailed test.
02:47Now this has to do with calculating false positive rates and I recommend that
02:52you always stay with a two-tailed test unless you have some super-compelling
02:56reason to go with the one-tailed.
02:59Also, we have the option of flagging statistically significant correlations.
03:02That's very helpful and I'd leave that on there, and let's come over here and take
03:06a quick look at the other options.
03:09You can also get means and standard deviations for each variable, but we don't
03:13need that at this point, because we should have done that already.
03:16You can get what are called cross- product deviations and covariances and that's
03:19a little technical and we don't need that.
03:22The other question is whether you want to exclude cases pairwise or listwise.
03:26I've mentioned these before.
03:28Pairwise means that you might have a different sample size for each set of
03:32correlations. If for instance everybody has data on two particular variables, but
03:38you're missing a lot of information on another variable, you would end up with
03:41different sample sizes.
03:43This isn't necessarily a problem and I usually leave it at pairwise.
03:46However, there may be times when you only want to deal with cases with complete
03:51information, in which case you would choose listwise.
03:53But I am going to leave it at the default for right now.
03:55So I'll press Continue and I'll press OK.
03:58Now I asked for a lot of variables and so what I get here is a very large table.
04:03You can see that it goes down a long way and it goes across a long way.
04:08You can also tell that the labels aren't there and when we scroll down it's hard to see.
04:12But that's okay, and what you see here is that every variable is listed down this side.
04:19We have Facebook to SPSS to Regression as Google Searches, and we have the same
04:23variables listed across the top: Facebook, SPSS, Regression, and so on.
04:28Then what you have is a cell that gives information about the
04:31association between each one.
04:33In each cell the top number is the Pearson correlation.
04:37That's the actual correlation coefficient.
04:39It goes from 0 to 1 and 0 means no linear relationship and 1 indicates a perfect
04:46linear relationship.
04:47It can be positive or negative.
04:50The positive or negative has nothing to do with the strength of the relationship.
04:53It only indicates whether it's an uphill or downhill relationship.
04:57The second number it says Sig. Two-tailed.
05:00This is the probability value that's associated with the significance test for
05:04the correlation, and the third one is the N or the number of cases that go into
05:10calculating that particular correlation.
05:13This dataset has complete data for all 51 cases.
05:16That's the 50 states in Washington, D.C.
05:19Additionally, you see that down the diagonal we have a series of 1s and blanks and 51s.
05:26That's because it's each variable correlated with itself which will always be a
05:30perfect positive correlation, and truthfully some programs just don't put
05:34anything there at all.
05:35But let's say I'm interested in the relative interest in each state in
05:41searching for Facebook.
05:43Then what I want to do is I want to go down this first column.
05:46It says Facebook at the top and I want to scroll down and I want to look for
05:49statistically significant correlations.
05:52Now SPSS makes this easy, because they will put asterisks next to
05:56statistically significant correlations.
05:58So you see for instance the top is Facebook correlated with itself.
06:02That doesn't really mean anything.
06:03Facebook and SPSS have a correlation of -.184.
06:08It's not a very strong correlation.
06:10It's closer to 0 than it is to + or -1 and you can tell that its
06:14probability value is .196.
06:15It's nowhere close to a statistically significant.
06:19However, we do see that in the next few we have statistically significant
06:24negative correlations.
06:25The higher a state's interest in Facebook the lower its interest in searching on
06:31Google for regression or statistically significant or business intelligence.
06:35We can scroll down and see some more.
06:37Similarly, lower interest in data visualization, they're also less likely to use
06:42the term totally lost.
06:44On the other hand, states that show a relatively high interest in Facebook also
06:48show a relatively high interest in searching for American Idol.
06:52That's the correlation of .516 and as that probability value of 000 is not
06:58actually a 0, but it means that it rounds off to less than 001.
07:03As we scroll down we see that modern dance goes into it and NBA.
07:07Interestingly, NFL does not correlate, but the NBA and FIFA do.
07:13Also, as we scroll down we can see that states that have an NFL team show a
07:18lower interest in Facebook, similarly for an NBA and MLS.
07:22It's just as whole series of correlations that show things that can be used to
07:26predict the level of interest in a particular item.
07:30Now the most important thing probably to remember here is that correlations are
07:35simply associations.
07:36They don't explain why the variables are associated.
07:39It's simply a predictor.
07:41The matter of explaining why they are correlated is a whole different issue
07:45about causation and something that we need to be careful about.
07:49So in summary, correlations are great way to look at the strength of associations
07:53between two variables.
07:55The correlations of general purpose they can be used with scale variables,
07:58ordinal variables or dichotomous variables, and they can give a good way to
08:02compare associations across a number of procedures.
08:05For that reason it's a good idea to always include correlations in your analyses.
08:10However, there are also some more specialized procedures that are helpful to use
08:14and we will turn to those next.
Collapse this transcript
Computing a bivariate regression
00:00In the last movie, we use correlations to look at the strength of association
00:05between two variables.
00:06However, correlations are standardized measures.
00:10That is, they don't involve a unit of measurement.
00:12It's not a correlation of 0.78 meters or anything.
00:16It's just a correlation of 0.78.
00:19And what that can be really handy, because it makes it easier to compare
00:22associations across different kinds of variables, it can also be really nice
00:26to put the association back into the original metric.
00:30To do that we'll look at another procedure that's very closely related to
00:33correlation and that has many of its advantages, but that also uses the original
00:38units of measurement.
00:39That is bivariate linear regression.
00:42As a note SPSS has a wonderful new procedure called Automatic Linear Modeling
00:47that also performs linear regression which we'll cover a little bit later.
00:51For now though, it makes more sense to stick to the standard linear regression,
00:54because we're only using one predictor variable and automatic linear modeling
00:58seems to a little like overkill for that.
01:01And second, automatic linear modeling does an awful lot of work behind the
01:05curtains and it's kind of nice to keep things visible for right now.
01:08As that in mind here's how to do a bivariate linear regression in SPSS.
01:14For this example, we'll be using the Google Search data again, Searches.sav,
01:17where we will be using the percentage of people in a state with bachelo'rs
01:22degrees or higher as a way of predicting the relative level of interest in
01:27Facebook as a Google Search topic.
01:30To do this we go first to Analyze and then we come down to Regression and we go
01:36to the second one down, Linear.
01:39We need to take our outcome variable, that is the thing we're trying to predict,
01:43and put it in the Dependent box.
01:45This means dependent variable or the variable that depends on other variables.
01:49In this case, that's going to be Facebook, that is Facebook as a relative
01:55interest in Google searches.
01:57Independent is the variables that we're going to use as predictors, in this
02:01particular case I'm going to be using the Percent of Population with a
02:05bachelor's degree or higher.
02:08Now the linear regression command is actually tremendously sophisticated and
02:12gives tons of options.
02:14None of which I'm going to use at this particular moment. I'm doing the simplest
02:18possible version here of simply using the Percent of Population with bachelors
02:24degree or higher to predict Facebook interest on Google Searches.
02:27And I'm going to do nothing else at this moment. All I'm going to do now is press OK.
02:33And I get a table that tells me the percent of population with a bachelor's
02:37degree or higher and that is using Facebook interests as a dependent variable.
02:42The next table down gives me an indication of the association. We have a
02:47correlation here of 0.644.
02:49That's the R. Now to capital R here, because that actually stands for multiple
02:53correlation which means you can use several variables to correlate with a single outcome.
02:58Although in this case we only have two variables so it's still bivariate.
03:01And then you have another one here that's called R Square and that is that the
03:050.415 is the square of the number next to it, the 0.644.
03:10And the reason you do this is because you can't really compare correlation
03:15coefficients. They are not linear.
03:17A correlation of 0.4 is not twice as strong as a correlation of 0.2, even though
03:22the number is twice as big.
03:24Instead, if you square them then you get numbers that are directly comparable
03:29and a correlation of 0.4 squared becomes 0.16 and a correlation of 0.2
03:33squared becomes 0.04.
03:36And so the other correlation is actually four times as strong.
03:39You also have something called Adjusted R Squared.
03:41Sometimes people report R Squared, sometimes they report Adjusted R Squared.
03:45An Adjusted R Squared changes the number according to the ratio of
03:50observations to predictors.
03:52We also have the Standard Error of the Estimate that goes into the
03:55probability values.
03:58And the next table is the ANOVA or ANOVA table. That's short for analysis of
04:02variance and it's an indication of the statistical significance of the model as a whole.
04:07If we had more than one predictor then this would be an important thing, but
04:11because we have only one predictor and we know it's statistically significant it
04:14doesn't really tell us anything extra right now.
04:17The next one down from that is coefficients, and what we see here is the slope
04:23and the intercept that we are familiar with from charting relationships.
04:28The Unstandardized Coefficients are the slope in the intercept in original units.
04:33And so what we see is if we're trying to predict the level of interest in
04:37Facebook on a state-by-state basis we have an intercept here of 3.240.
04:44That says give everybody an interest of three standard deviations above the
04:48mean, but then for every percentage of the population that has a bachelors
04:53degree or higher, subtract a tenth of a point from that. That's the -0.119.
05:00And that means it's a downhill.
05:02The higher the level of education, the lower the interest in Facebook as a
05:06Google search term.
05:07This will become clearer if I quickly make a scatterplot of the association
05:11between the two variables.
05:12I've already shown how to make scatterplot, so I'm going to go through this
05:15a little bit quickly.
05:16I come to Graphs to Chart Builder to Scatter, where I'm going to put level
05:24of education here in the X, and I'm going to put Facebook here in the Y and
05:30I'll just click OK. And it's clear.
05:34It's a very strong negative association.
05:37The higher the percentage of the population with a bachelors degree, the lower
05:41the relative interest in Facebook as a search term.
05:45So the similarities between bivariate correlation and bivariate regression, which
05:50we just did, are pretty easy to see in this example.
05:53They both give the same standardized effects and the same P values.
05:57The difference is that the regression model also gives the intercept and slope
06:01for the model which is a nice piece of information.
06:04Also in a later section we'll see how this procedure can be very easily
06:09adapted to having several predictor variables, in which case it's called
06:12Multiple Regression.
06:14And while it's possible to use categorical predictors in linear regression,
06:18the basic approach doesn't work well when the outcome variable is categorical.
06:22Instead, it's more common to use cross tabulations, which we'll turn to next.
Collapse this transcript
Creating crosstabs for categorical variables
00:00In the last two movies we looked at ways to assess the relationships between two variables.
00:05We looked at correlations, which work for pretty much any kind of variable, and we
00:10looked at bivariate linear regression, a closely related procedure, but one that
00:14doesn't work with categorical outcome variables.
00:16If you do have a categorical outcome variable and a categorical predictor, you
00:21can still use correlations as long as those variables are coded as 01
00:25indicator variables.
00:27But it's more common to use what's called a crosstabulation or crosstab for short.
00:31This is simply a table with rows and columns that crosses, hence the name
00:36crosstabulation, the combinations of categories in the two variables.
00:41Each box or cell in the table simply indicates how many people have that
00:44particular combination of the two categories.
00:48To do this example, I'm going to use the GSS dataset and I'm going to show the
00:53relationship between marital status in this particular dataset and overall
00:58levels of happiness.
01:00To do this, I first come up to Analyze, to Descriptive Statistics.
01:04Now this one right here, Tables, refers to Custom Tables, which is a separate
01:08add-in that you pay for in SPSS.
01:10But the one that comes standard in everything is right here under Descriptive
01:14Statistics, to Crosstabs.
01:16That's the one I'm going to use in this example.
01:18All I need to do is specify the variables that I want to depict the rows and the columns.
01:24In this particular example, I'm going to use Married to separate the rows, so
01:31those will be the ones going across.
01:33The columns, which I'll use for my outcome variable, is going to be the indicator
01:37of happiness, and that is near the bottom of the dataset.
01:40It's this one called Self-rated Happiness.
01:44I'm going to drag that up to the columns.
01:48Now if I do this, it will simply give me the number of people who fall into each category.
01:52There are generally a couple of things I want to add.
01:56The first one is under Statistics.
01:59I want to add a measure of association for this with something called a Chi-square.
02:04I click on that.
02:06That's a statistic that shows changes in distribution to
02:09cross-categorical variables. Press Continue.
02:13The next one is what numbers I actually want to have in the cells.
02:16Now sometimes the two groups, like for instance Married and Not Married, can be
02:21very different sizes in which case it's hard to compare the raw frequencies.
02:25Instead what I might want to do is break down the percentages so I know what
02:29percentage of people who say they're married, say they're not too happy, or
02:34pretty happy or very happy.
02:36And the easiest way to do that is with what's called a Row Percentage, because I
02:40want to get the percentage of people going across who fall into each column.
02:45Now if I have my data organized differently, I might want column percentages,
02:48where I look at the percentage of people in each column who fall into particular rows.
02:53Either way. In this one I just want to use a row percentage.
02:56So I'm going to press Continue now and then I'll just press OK.
03:02And what I have here first is the Case Processing Summary.
03:06This tells me that we had complete data from 349 people.
03:10Now I actually have complete data on these particular variables.
03:13If any of my cases were missing a value on one or the other of these variables,
03:18they wouldn't be included.
03:19So crosstabs only work with complete data.
03:22This next table is the crosstabulation itself and what we have on the left is
03:27that says whether people reported that they were married or not married, so it's
03:31married yes and no.
03:33Across the top we have self-rated happiness with not too happy, pretty
03:36happy, and very happy.
03:38And what we see at the end of that is the totals, so there is a 170 people who
03:43were married and 179 were not married.
03:46It's coincidental that we have very close numbers on these ones.
03:50And what you can see as we go across is the percentage of people who were
03:54married, who said for instance they were very happy, was 44.7%.
03:59That's 76 people out of 170.
04:02On the other hand of the people in this dataset who were not married, 44 of them
04:06said that they were very happy, which is 24.6%, so it's a lower percentage.
04:11The percentages of people who said they were pretty happy are close to each
04:14other for the two groups, 51.2% for those who are married, and 55.3% for those who weren't.
04:21And the percentage of people who are not too happy changes also.
04:25We have 4.1% of the people who are married so they weren't too happy and 20.1%
04:30of the people who weren't married and say they weren't too happy.
04:33The last table is called the Chi-Square Text.
04:36That's the inferential statistic here and we're looking at the top one that says
04:40Pearson Chi-Square. The actual value of the test statistic is 28.653.
04:47The next number is what's called the degrees of freedom and it has to go into
04:50the calculations of the probability levels.
04:53It has 2 degrees of freedom in this case.
04:55And this third number is the asymptotic significance level of 2-sided.
04:59That's the probability level that goes into the hypothesis test.
05:03In this case, it shows up as .000.
05:06It's not actually 0 all the way through, but it's a number that is smaller than .001.
05:11And what this shows us is that the distribution of self-rated happiness is
05:15different for the two groups on the marital status variable.
05:19It's important to remember again, this is simply showing a correlation of
05:23self-reported variables.
05:25And why there might be an apparent association between these two is a whole
05:29different issue, but that's true of any measure of association.
05:33And so a crosstabulation is a great way to show the relationship between two
05:37categorical variables.
05:39By selecting the row or column percentages, you can make it easier to
05:42compare the groups.
05:44And the chi-square inferential test lets you know whether any differences you
05:48see are large enough to become statistically significant.
05:51And again, it's worth remembering that if your categories are dichotomies with
05:54only two groups, like yes/no or male/female and if the variables are coded as 01
05:59indicator variables, then you can also get a correlation coefficient for the
06:03association that will have the same result on the significance test.
06:07That is, it'll have the same probability value and the same result in terms of
06:11rejecting or retaining the null hypothesis.
06:14However, the row and column percentages are a nice perk of the crosstabs
06:18procedure and in any case, if your variables have more than two categories, then
06:22you would want to do the crosstab and Chi-square anyhow.
06:25And with that in mind, the next several movies will address ways to investigate
06:30the mean scores on scale variables for different groups.
Collapse this transcript
Comparing means with the Means procedure
00:00In the last few movies we've discussed a few different ways to look at the
00:04association between pairs of variables.
00:06We looked at the correlation coefficient, which is an excellent general purpose tool,
00:10and we looked at bivariate regression which works really well when your
00:13outcome variable is a scale variable.
00:16We also looked at cross tabulations for when you have two categorical variables.
00:20But another very common situation is when you want to compare the means of two
00:23or more groups, or one group at more than one point in time.
00:27Although it's possible to do this with correlations and regression, if you go to
00:30Group Membership as 01 indicator variables, it's often easier to use specialized
00:35procedures for comparing group means for a few reasons.
00:39First, they generally give you the group means along with the inferential tests
00:43and maybe even charge for the mean.
00:44So you can get more done on a single command.
00:47Second, these procedures often provide explicit tests for the assumptions behind
00:51the tests, such as the groups having equal spread in their scores.
00:55Third, the test statistics that they give, often the t-test or an analysis of
01:00variance, depending on which procedure you use, are the most common statistics
01:04for group comparisons, and so they may be more familiar to more people.
01:09Now one of the recent additions to SPSS is the flexible means procedure.
01:13What's nice about this is that previously you had to choose different tests
01:17if you're comparing two groups or if you are comparing the means of more than two groups.
01:23And we will in fact cover these procedures in the next few movies.
01:26The means procedure on the other hand can handle either situation, and let's see how it works.
01:32For this example, I'm going to be using the GSS dataset, General Social Survey
01:36that I've used before.
01:37And to compare means, I need to come up to Analyze, to Compare Means, then I
01:43choose the first one, Means.
01:45And from here I need to choose the variable that I want to look at as a
01:49dependent or the outcome variable, the thing that I think group
01:52membership affects.
01:54In this particular case, I'm going to use Family Income.
01:56So I can click that and I can drag it up there.
02:00Then I need to look at the Independent list.
02:03Those are the variables that I think will be associated and produce changes or
02:08simply be associated with family income.
02:11In this particular one, I'm going to choose a cultural variable, I'm going to
02:15scroll down here, and I'm going to choose whether a person attended a dance
02:20performance in the last year.
02:21I'll click and move that into the Independent list.
02:23Then I'm going to come up to Options and I have the possibility here of getting
02:29the huge amount of statistics, including some relatively esoteric things like the
02:34harmonic mean and the geometric mean.
02:36The mean, the number of case, and the standard deviation on the other hand are
02:39good default, though I'd like to have them in slightly different order.
02:42So what I'm going to do is I'm going to click to get these out, just
02:46double-clicking, and then I'll bring them back in with a number of cases first
02:50and then the mean and then the standard deviation.
02:53Also I'm going to come down to the bottom here where it says Statistics for
02:57First Layer and check the first box for Anova table and eta.
03:02Anova is short for Analysis of Variance and it will give me an inferential test
03:07about whether the means for the groups differ.
03:10And eta is similar to the correlation coefficient except it can be used when
03:14there is more than two groups.
03:16So I'm going to select that one and I'm going to press Continue and then
03:20I'll press OK again.
03:23And what I get is several tables that show up.
03:26The first table is the Case Processing Summary and it lets me know that I had
03:30complete data for all 349 cases in the dataset, so that's good.
03:34The second table labeled Report gives me the actual statistics, the descriptive
03:39statistics for my two groups on family income.
03:42So for instance, we see that there were 273 people who had not attended a dance
03:48performance in the previous year and their average family income was about
03:52$29,000 with a standard deviation of almost $26,000.
03:57On the other hand, there were 76 people who had attended the dance performance
04:01in the last year and their average income for the family was nearly $47,000,
04:05so that's much higher.
04:08And they had a standard deviation of about $36,000.
04:12So you can see there is a very substantial difference there in the means,
04:16although the standard deviations are also rather large.
04:19The next table that says ANOVA table or ANOVA table for Analysis of Variance is
04:25the inferential test to let us know whether these two means differ statistically
04:28significantly from each other.
04:30The important number here is in the very last column under Sig.
04:34That's the probability level or the significance level of this particular
04:38result and it says .000.
04:40It's not literally 0.
04:42It simply is less than .001.
04:45And this tells us that there is a statistically significant difference
04:49between these two means.
04:50On the other hand, there's also the question of how big is the effect and
04:54that's what we get from the fourth table that says Measures of Association.
04:58It looks at the association and gives us a statistic called eta.
05:02And that is a version of the correlation or analogous of the correlation that
05:06can be used when there's even more than two groups.
05:09Now our value here is .252.
05:10Eta, like the correlation coefficient, goes from 0 to 1.
05:16And here we see that it's not terribly high but it is above one and the Eta
05:21Squared is an indication of how much of the variance in the family income can be
05:26explained by group membership, by having knowing whether a person attended a
05:30dance performance in the last year or not.
05:32And here we see it's .064.
05:34That can be read as a proportion as 6%.
05:37So what we see is that there is a statistically significant difference in the
05:41means between the two groups.
05:43It's not huge because the standard deviations are large, but it does let us know
05:47that there is an association, that people who saw dance performances generally
05:52had higher family incomes than people who had not attended dance performances
05:57for whatever reason that might be.
05:59So the means procedure is a handy way to compare the means of any number of
06:03groups on any number of variables.
06:05Not only does it give the descriptive statistics and an inferential test,
06:09it also gives a measure of association.
06:11This makes the means procedure a flexible and easy way to get a lot of
06:15tests done quickly.
06:17In the next two movies, we'll look at the specialized procedures for comparing
06:20the means of two groups or two or more groups, each of which may provide some
06:25information and options that aren't available in the means procedure.
06:28So they may be more useful for you as you explore your own data.
Collapse this transcript
Comparing means with the t-test
00:00In the last movie we looked at the general purpose means procedure, which is a
00:04recent addition to SPSS's bag of the analytic tricks.
00:08That procedure allowed us to compare for example, the means of two groups
00:11on scale variables.
00:13However, SPSS also has a specialized procedure for this comparison
00:17that's been around since version 1.0 as mainframe and punch card days. That is
00:21the Independent Groups T-Test.
00:23It's called the Independent Groups because it's comparing the means of two
00:27different groups as opposed to for example, the means of the same group on two
00:31different variables or two different points in time which we will cover later.
00:35Because this procedure gives a few more pieces of information than the means
00:38procedure does, we will take a close look at it too.
00:41For this example, I am going to use the same GSS data set and the same variables
00:46that I did in the last one, when we look at the means procedure, so you can
00:49compare the results of the two of them directly.
00:52To compare the means with the Independent Means T-Test I go to Analyze, come
00:57down to Compare Means, and I go to the third choice which is the Independent Samples T-Test.
01:04From there, I need to pick the Test Variable, those are the ones that I am
01:08looking at for outcomes.
01:10In this particular case, I am going to look at Family Income.
01:13You can see, however, that I can do a lot more at once.
01:18Then I have the Grouping Variable, sometimes called the independent variable or
01:21the predictive variable.
01:23It's the groups with the different means.
01:25In this case, I am going to use the Dance Performance question.
01:28So I can click that and I move it into Grouping Variable.
01:30However, with this procedure, I need to explicitly tell SPSS what the codes are
01:36for the two different groups.
01:37So I click on Define Groups and I tell it that I am using a 0 and a 1.
01:43Now the interesting thing about this is that it means that if you have more
01:46than two groups, you could select two at a time to compare them here.
01:51Also, if you are using a scale variable as your predictor, you can select a cut point.
01:56For instance, people above 7 on a 0 to 10 scale.
02:00But I am just going to put in that this is a 01 indicator variable and I will Continue.
02:05Under Options, it asks me what Confidence Interval Percentage I want to use.
02:1095% is the default and it's used when you have at least reasonably large samples.
02:16It may be if you have a small sample, that you would want to use a smaller number
02:20like 90% or maybe even 80% but generally we stay with 95%.
02:25Also there's a question about whether I want to Exclude cases, analysis by
02:29analysis, that's if I had several variables I was looking for Group Differences
02:33on and that means that if they were missing it on, for instance, the first
02:37variable, they wouldn't be included there but they would be included by other
02:40ones, or whether I want to exclude cases list wise, which means if they're missing
02:44the score on any of the variables, they get left out entirely.
02:48That gives you consistent sample size across tests.
02:51Now I'm only doing one outcome variable.
02:53So it would give the exact same thing anyhow. I am just going to leave it as default.
02:57I will press Continue, then I will press OK.
03:02And what I have here are a couple of tables.
03:04The first one is the Group Statistics.
03:07Now this is the same as what we saw in the Means procedure.
03:10This tells means that 76 people said they saw a dance performance in the last year
03:13and that their average income, their Mean, was about $47,000 with a Standard
03:19Deviation of 36,000.
03:21On the other hand, we have a new column here down it's called the Standard error
03:25of the mean and that actually is the standard deviation divided by the square
03:29root of the sample size. But it's something that's used as part of the
03:33inferential procedure.
03:34So we usually don't need to deal with that one directly.
03:37The second table, it says Independent Samples Test and this is where we have the
03:41inferential procedure. What's interesting though about doing this command in
03:45SPSS is that it actually gives us two procedures. The first one, in the Columns
03:50it says, Levene's Test for Equality of Variances.
03:53This is a specific test for an assumption for a valid t-test and the idea here
03:59is that their groups shouldn't be too different from each other in how spread
04:02out their scores are.
04:04And what we see here in the top table is that the one group had a
04:08standard deviation of 36,000 and the other group had a standard deviation of about 26,000.
04:14And what the Levene's test tells us is that these two groups do not have equal
04:18variances, which are related to the standard deviation.
04:21As such I really shouldn't use a standard t-test, which is the one across the top;
04:27and instead I should use one that has something called Fraction of Degrees of
04:31Freedom and that's the one on the second row.
04:33One the other hand, they give functionally the same output.
04:38Let's look at this test.
04:40It says T-test for quality of means, and we have three members. We have the T that's
04:44the actual value of the test statistic, then we have the Degrees of Freedom
04:48which is used in calculating the probability value. The third one, Sig (2-tailed),
04:53is the actually probability value and the result of the inferential test.
04:57In both cases, it comes out as 000. Again it's not literally zero.
05:02It's just is less than 001.
05:04So, regardless of which test we use, we find that there is a highly significant
05:09difference in the means between these two groups. And if I scroll over to the
05:13right a little bit, I can see the rest of this table and what it does is it's
05:17giving me a 95% confidence interval of the difference between the two groups.
05:23And you could see, it's slightly different for these two verses of the T-Test but in
05:27either case, we have a large difference in the means.
05:30It's about $18,000 and the confidence intervals are somewhere between $9,000 and
05:36$27,000, difference between those who say they have seen the dance performance
05:41last year and those who haven't.
05:43So the specialized procedure for comparing the means of the two different
05:47groups, the independent samples t-test, it's a convenient test.
05:51It provides a few extra options over the general purpose means procedure and if
05:56you have more than two groups you may want to look at another specialized
06:00procedure called the one-way analysis of variance, which we will turn to next.
Collapse this transcript
Comparing means with a one-way ANOVA
00:00In the last movie we looked at a procedure to compare the means of two different
00:04groups on a scale variable using what's called the Independent Samples T-Test.
00:09On the other hand, if you want to compare the means of more than two groups, you
00:12would want to use something called the Analysis of Variance or ANOVA.
00:17And although you can use ANOVA with two group comparisons, and there's a simple
00:21conversion formula between the ANOVA results and the T-Test, it's more common to
00:25reserve it for times when you have three or more groups.
00:28What the Analysis of Variance does is look for any kind of difference between
00:32the means of the various groups.
00:34That might mean that Group A is different from Group B is different from Group
00:37C, or it might mean that A and B together are different from Group C or any of
00:43several other possible combinations.
00:45For this reason you'll want to do a couple of things when you do an Analysis of Variance.
00:49First, you'll want to look at the group means, such as with a bar chart of the
00:53means to see if any natural groupings emerge.
00:56Second, you'll want to do something called a Post Hoc Test.
00:59That's for after the fact.
01:01That can tell you where the differences specifically are.
01:04We will look at both of these in this example.
01:07For this demonstration I am going to use the Google Searches information in
01:10Searches.sav, and to get the Analysis of Variance what we need to do is go up to
01:16Analyze, to Compare Means, to what's called the One-Way Analysis of Variance.
01:22It's called One-Way because we're going to use a single categorical variable or
01:26factor to differentiate between the groups.
01:28This is because there are other versions of the Analysis of Variance where you
01:31can have more than one categorical variable.
01:33We have just one, so this is the One-Way Analysis of Variance.
01:37You can check more than one variable at a time by putting it into the Dependent List.
01:40These are the outcome variables where you're looking for differences.
01:44In this particular case I'm just going to use one and I'm going to use the
01:47relative interest in searching for the NFL in Google, and I am going to look
01:52for regional differences on that.
01:54So I find the regions of the U.S..
01:57that's Census Bureau Regions, and I put that under Factor.
02:00In the Analysis of Variance the categorical variable is called a factor and the
02:04categories within that variable are called levels.
02:07So we have four groups within the Census Bureau Region, so we will have four
02:11levels in the factor of region.
02:13Now we come up and we check a few other things.
02:15The first possibility is Contrasts.
02:19Now, this is something that we can ignore, because it's for specialized
02:23comparisons,like changes over time or mathematical combinations of group,
02:27something called planned contrasts, and we're not doing any of that so we can
02:30just ignore this one for right now.
02:31I will press Cancel.
02:33The second one that we want to look at is called Post Hoc, again for after the fact.
02:38Now, we have a lot of choices here.
02:41The most common choices are what are called the Bonferroni and the Scheffe Tests.
02:46They're common, but statistically speaking, they're not perfect.
02:49They tend to be a little over- conservative and their output can be a little
02:53complicated in SPSS.
02:55For that reason, I prefer to use a test called the Tukey test.
02:59It's named after John Tukey, the statistician, and it's full name is actually the
03:02Tukey Honestly Significant Difference Test or HSD Test, which is what you'll see in the output.
03:07So I am going to click on the Tukey Test.
03:10Then I will just come down and hit Continue.
03:12Now let's take a quick look at the other Options.
03:16I click on the Options and I can get Descriptive Statistics, which are helpful
03:20for this kind of analysis.
03:22I can also get a Means Plot.
03:24It's a simple line plot, but it's still helpful for looking at a graphical
03:28representation of the differences between the means.
03:31So I am going to click on Means Plot and then I will click Continue.
03:34Now we're back in the main dialog and I will click OK.
03:39Here we have several tables that show up.
03:42The first one is the Descriptive Statistics.
03:44It gives me the mean for each of the four groups in this Factor.
03:48It tells me, for instance, that the relative interest in searching for the NFL
03:52in the Northeast is below average. It's -.36.
03:55That means that one-third of the standard deviation below the national average
04:00for states and relative interest in searches for the NFL.
04:04The Midwest, on the other hand, is much higher.
04:06It's three quarters of a standard deviation above the mean, with a mean of 0.75.
04:12The South is slightly below 0 at -.07.
04:16And the West is, again, about a third of a standard deviation below 0, at -.33.
04:22The next column over is the Standard Deviations and they go from about .8 to
04:261.1, and they're not hugely different, and they feed into the Standard Error,
04:30which is used for the inferential tests.
04:32But otherwise we can ignore these.
04:35Now, this is the Analysis of Variance table or ANOVA table and what it does is
04:39on the top corner it tells me that it's looking at the variable NFL and you see
04:44that it's statistically significant. In the last column under Sig it has .020.
04:48That's the probability value for these, and the general guideline is if it's
04:52under .05, it's statistically significant.
04:55Beneath that are the results for the Tukey Post Hoc Test.
04:59Now, this first table of Multiple Comparisons is kind of complicated and we can ignore it.
05:03Let's go to the one beneath it.
05:06This one is called Homogeneous Subsets and what this does is it places the
05:10groups in like with like, and this tells us that the Northeast and the West
05:16and the South are all relatively similar to each other in terms of their
05:20searching for NFL and Google.
05:22You can see they all have negative means.
05:25On the other hand, the second group is kind of interesting.
05:28Midwest is much higher, so that makes sense.
05:30The South is still with it and the reason for that is even though the South
05:34and the Midwest are different from each other, they still have some overlap
05:38with the Standard Deviation.
05:39So they are not significantly different from each other and this becomes clear
05:42if we go down one more and look at the Means Plot.
05:45Here you can see that the Midwest is much higher, and the South, while it's down
05:49lower, is still above the West and the Northeast.
05:51So the Northeast, the South, and the West all form a group, but the Midwest and
05:56the South actually combine as well.
05:58But the point here is we are able to do a lot of comparisons and get a lot of
06:02information from this one test.
06:04The Analysis of Variance is a very flexible and useful procedure for comparing
06:08the means of several different groups.
06:10In combination with a graphical analysis and Post Hoc Tests, you can get a lot
06:14of insight in a little bit of time.
06:16In the next movie, however, we'll backtrack just a little to look at a variation
06:20on the T-Test, one in which you can look at changes over time for a single group
06:24of people or look at differences between two different variables using what's
06:28called the Paired T-Test.
Collapse this transcript
Comparing paired means
00:00In the last few movies, we have looked at procedures that can compare the
00:03average score of two or more groups on a single variable.
00:07However, there may been times when you are more interested in comparing the same
00:11group on two variables, either the same idea measured at two point in time or on
00:16two related variables that are on the same scale.
00:20In that case, you will want to use something called a paired t-test also known
00:24as a within subjects t-test or repeated measures t-test.
00:28The nice thing about this test is that each person serves as their own little
00:31comparison or control group which makes it much more precise.
00:35In fact, what's really going on with this test is that you are getting the
00:38difference between each variable for each person and you're looking at that
00:43change between the two and then doing simply a one sample t-test on those
00:48different scores, just like we did in an earlier section.
00:51For this example, I am going to be using a new dataset that's
00:54called Success.sav.
00:55This is from a survey of adults in the Midwest on how much money they felt a
01:01person needed to earn annually to be considered successful as the first variable,
01:06and then also how much money they felt a person needed to earn annually in order
01:10to be happy and we are looking at whether there's a difference in the means
01:14between these two groups.
01:15To do this, we come up to Analyze, to Compare Means, to the Paired Samples
01:21T-test, and what you need to do is select two variables at a time.
01:25This is easy because we only have two variables.
01:28So I select the both of them over here, I am Shift+clicking, and then you move
01:32them over to the right as a paired variable.
01:34Now let's take a quick look at the Options.
01:37You get a Confidence Interval of the difference as 95% by default.
01:41You can change it to 90 or something if you have a really small sample.
01:45Also, you can talk about how you exclude cases either by analysis-by-analysis or
01:49list wise but since we are only making one comparison, these will be the same.
01:53So I am just going to press Continue and then here I will press OK.
01:58We get a few tables of output from these procedures.
02:00The first one gives the Descriptive Statistics for the two variables.
02:04So for instance, we see that for this particular sample, the average amount of
02:08money that people felt a person needed to make annually to be considered
02:12successful was $64,000. That had a standard deviation of about $35,000.
02:18On the other hand, the amount of money that people thought a person needed in
02:21order to be happy was lower at about $42,000 a year, with a standard
02:26deviation of about $33,000.
02:28That table also has the standard error of the mean at the end. That simply
02:31goes into calculating the inferential statistics and we don't need to deal with it directly.
02:35The second table is the Paired Samples Correlation, because these are the scores
02:39for the same group of people each person answered the both of them, you can
02:43calculate a correlation and we see here that we have a statistically significant
02:48positive correlation.
02:49What that means is people who put a high answer for one question, for instance,
02:53how much he needed to be successful, are also more likely to put a high answer
02:57for how much you needed to be happy and vice versa.
03:00People who put a low answer would generally put a lower answer for the both of them.
03:04But the important question about whether people put different answers for the
03:07two of these is answered in the next one.
03:10We see that if we take each person's response to the question how much money
03:13you need to be successful, and subtract the amount of money you need to be
03:17happy, the difference between those is about $22,000 a year with a standard
03:22deviation of $30,602.
03:26The standard error for that difference is next, but we can ignore that and then
03:30we have a Confidence Interval for the difference and this lets us know that
03:33while the difference in this particular sample was about $22,000 a year.
03:38In the larger population the difference between the amount of money you need to
03:42be successful and to be happy could be anywhere between $16,700 and $27,600.
03:49The next column that says T. That's the actual inferential test.
03:52That's the one sample t-test.
03:54We have a value of 8.079.
03:57The next column, the decrease of freedom is related to how many people there are in the sample.
04:02The last one here of interest and that is the significance level, the
04:05probability value for the hypothesis test.
04:08In this case, it says 000 and that means it is actually less than 001.
04:13It's a very small probability value and this means that this is a statistically
04:16significant difference.
04:18On the other hand, looking up at the top table where the first mean was $64,000
04:22and the second mean was $42,000, we can see there is a big difference of
04:26$22,000 between what people believe you need to make to be successful and what
04:31you need to be happy.
04:33So this example shows another variation on the procedure that SPSS gives you to
04:38compare means, only this time it compares means on two different variables for
04:42a single group of people.
04:44I should mention it's also possible to look at changes in several points in
04:47time or differences in the evaluations of several different products and
04:50variables but those procedures become rather complicated and we won't address
04:54them in this course.
04:56We will, however, start looking at ways to explore the relationships of three or
04:59more variables at a time, starting with the next movie.
Collapse this transcript
9. Charts for Three or More Variables
Creating clustered bar charts for frequencies
00:00Up to this point, we've covered methods for looking at one variable at a time as
00:05well as methods for looking at the associations between pairs of variables.
00:09In each case and consistent with good analytical practices, we started with charts
00:13because data is usually much easier to understand visually.
00:17Then we've done numerical descriptions of the variables and associations, and
00:22finally, we've done inferential statistics to generalize beyond the given data.
00:27In these last few sections, we'll take that pattern one more step by looking at
00:31methods for exploring the relationships of three or more variables, first with
00:35graphs and then with numbers.
00:37A quick word about terminologies in order, when you look at one variable at a
00:41time it's called a univariate analysis.
00:44When you look at the associations between pairs of variables, it's called
00:48a bivariate analysis.
00:49Therefore it would make sense that when you're looking at multiple variables, it
00:54would be called a multivariate analysis.
00:56However, that term multivariate is typically reserved for situations where you
01:01specifically have more than one outcome variable.
01:05Those kinds of statistics are much, much more complicated than what we're
01:08going to be doing, which is using more than one predictor variable with a
01:13single outcome variable.
01:15So I will generally avoid the term multivariate and instead just talk
01:19about multiple variables.
01:21With that in mind, let's look at our first chart for multiple variables.
01:25And just like when we did charts for one variable or pairs of variables, we'll
01:29begin with bar chart for categorical variables.
01:32Just this time, we'll have three categorical variables.
01:35To demonstrate this, I'll use the General Social Survey dataset in GSS.sav.
01:41What we need to do is begin by going up to Graphs in the menu bar and we
01:46come to Chart Builder.
01:47Then we come down to Bar, except instead of Simple, we're going to use
01:51Clustered this time.
01:53So I drag the Clustered bar chart up to the canvas.
01:56What we're going to look at as an outcome variable in this particular example is
02:02a person's self-rated happiness.
02:04Sometimes the easiest way to look at your outcome variable is to make it so that
02:07the colors of the bars go there.
02:09So I'm going to take self-rated happiness and I'm going to drag it over to
02:12Cluster on X set color.
02:15Then we need a categorical variable on the X-axis.
02:19I thought it would be interesting to see whether a person had attended a live
02:22drama in the last year.
02:24I'll put that on the X-axis.
02:27So that's two categorical variables for using attendants at a live drama to
02:31predict self-rated happiness, but that's just two variables.
02:34We need a third one and to do that, we have to come down to this tab that says
02:38Groups and Point ID.
02:40I click on that, then I come down to either adding a Rows panel variable or a
02:45Columns panel variable.
02:47And all that influences is whether the charts show up one above the other or
02:50one next to the other.
02:52In order to keep it compact, I'm going to do a Rows panel variable.
02:56Then I need to add one more variable that creates pairs of charts.
03:01And I'm going to use gender.
03:02I'm just going to come right up here to this one that says Male and drag that over here.
03:10And so you see what I'll end up with is four groups of three bars.
03:14Now I just come down to OK and I can make the chart.
03:18There is a lot of code that goes into that, and we can save that for future reference.
03:23And then what we have here is bar charts.
03:26On the left, we have whether people attended a live drama in the last year.
03:29More people have not. It's about 3:1.
03:32And then on the right are people who say they have attended one.
03:36The top two are for women.
03:38The bottom two are for men.
03:40The blue bars are not too happy, the green bars are pretty happy, and the beige
03:45bars are very happy.
03:47We do have one small problem with this chart and that is that a lot smaller
03:52number of people have seen a live drama in the last year.
03:55That's because we're charting counts here.
03:57A really handy feature in SPSS is the ability to chart percentages as well.
04:02So I'm going to show you how to go back and do that.
04:04I'm going to come back up to our most recent command, to Graphs, to Chart
04:09Builder, and then here in the elements property, I have Statistics and it says Count.
04:15That's how many people are in each category.
04:17I'm going to click on that and instead I'm going to go to Percentage and that
04:21has a question mark because I have to set a parameter over here.
04:25I find the most helpful one as each X-axis category.
04:29So what this'll do, it'll make things add up to 100% for those who have and for
04:33those who have not seen drama.
04:36So I select that. I click Continue.
04:38I have to come down here and press Apply and then I come over here and press OK.
04:45And what you'll see now is that the chart will look slightly different.
04:48The biggest difference is that the bars on the right side,
04:51for those who have seen live drama in the last year, are much larger than they
04:55were before because using percentages has equalized the two groups and it makes
05:02it much easier to see the pattern.
05:03For instance, we see that those who attended the live drama last year,
05:08interestingly, for men, those who have seen the live drama, the percentage who
05:14are very happy is smaller than the percentage of those who were pretty happy.
05:18On the other hand, for women, the percentage of people who were very happy is
05:23slightly higher than the percentage of people who were pretty happy for those
05:26who have seen a drama in the last year.
05:28On the other hand, for those who have not seen a drama, the patterns are nearly
05:31identical for men and for women. Where most people are pretty happy, the next
05:36group is very happy, and the least common is not too happy.
05:40A clustered bar chart could be a handy way to depict the relationships of these
05:44three categorical variables.
05:46However, you'll probably want to chart percentages instead of counts, but your
05:50choice of denominator can make a big difference on how the final chart looks.
05:54This gets back to a point that data analysis is probably best thought of as a
05:59form of storytelling and you want to choose displays that help you tell your
06:03story well or that help the data tell you something interesting and unexpected.
06:09It's worth noting that if your outcome variable is a dichotomous indicator
06:13variable, that's a 0/1, yes/no variable, then you can sometimes make things
06:17easier by charting the mean of the outcome which for 0/1 indicator variable will
06:22be the proportion of people who got 1s, for example, the proportion who are
06:26returning customers as opposed to first-time customers.
06:29And this leads us to the next chart we'll cover, the clustered bar chart for means.
Collapse this transcript
Creating clustered bar charts for means
00:00In the last movie, we looked at how you can make a clustered bar chart to show
00:05the association between three different categorical variables.
00:09In this movie, we'll look at how to show the associations between two
00:13categorical predictor variables and a single outcome variable that is scaled or quantitative.
00:18For example, you may want to show the average purchase price of items bought by
00:22men and women in two different retail categories.
00:25Surprisingly, this kind of chart is even simpler than the categorical version
00:29we just covered, because that one required that we use panels to show all three variables.
00:34With the scaled outcome though, we can use just a single panel like this.
00:39In this example, I am going to be using the General Social Survey data. GSS.sav again.
00:45To make the chart, let's go up to Graphs to Chart Builder.
00:50From there, we are going to come down to the Gallery to Bar Charts and choose a
00:54clustered bar chart.
00:56We'll drag that up here and what we are going to do is get our two predictor
01:00variables, placed one on the X-axis, one to set the cluster on X, the set color,
01:07and the third one, the Y-axis, will be our outcome variable.
01:10In this case, I'm going to try to predict family income. That will be my outcome variable.
01:16So I'll just grab family income and take it over to the Y-axis and I am going to
01:20use two variables to predict that.
01:21One is whether a person is a male or female.
01:26I am going to drag that down here to the X-axis. And another one is whether a
01:30person has children or not.
01:32I'll bring that over here.
01:35I think it'd also be helpful to put on error bars and I'll click Apply.
01:43Then I'll come back over here and click OK.
01:47There is a lot of code that goes into this so we can save and reuse later if we want.
01:53But here's the actual chart.
01:55So what we have here is women on the left and men on the right.
02:00People who do not have children are in blue and people who do have children are
02:04in green, and what's charted on the Y-axis is the mean family income that people reported.
02:10What's interesting about this is we have an interaction and that is, for women,
02:16those who do not have children reported a slightly higher average family income
02:21than those who do have children, although the standard deviation, the spread on
02:26these, is pretty big.
02:28On the other hand, for men, the exact opposite is observed.
02:32That men who have children report a substantially higher family income than
02:37those who do not have children.
02:38That's about 25,000-40,000.
02:42Now again, all this chart is showing us there is an association between the variables.
02:46It doesn't explain why those differences are there.
02:48There are a lot of reasons that go into that and it could actually require some
02:51pretty nuanced investigation.
02:54Nevertheless, this is a very simple chart that shows how two predictor
02:58variables, male/female as one category, and having children, yes or no, as
03:02another, can be used to predict scores on a third quantitative or scale
03:07variable, in this case, family income.
03:10So clustered bar charts for me is an easy and informative way to show how two
03:14categorical predictors are associated with the scaled outcome or an indicator
03:19outcome if you are doing 01.
03:21They also give a good idea of what the results of the inferential test would be.
03:25This kind of clustered bar chart can be one of the most effective tools that you
03:29have in exploring, analyzing, and presenting your own data.
03:33In the next movie, we'll look at another simple variation on a chart for when
03:37you have just one categorical variable and two scaled variables.
03:41In this case, the scatter plot as group markers.
Collapse this transcript
Creating scatterplots by group
00:00In the last pair of movies we've looked at the variations on the bar chart that
00:04let you use two categorical variables to predict scores on a third categorical
00:08variable or on a scale variable.
00:11In this movie, we'll change the balance a little by looking at a chart for times
00:15when you have two scaled variables and one category.
00:18This calls for a simple variation on the scatter plot that we covered in this
00:22section on bivariate graphs.
00:24The only big difference is that we'll be adding group markers for the
00:27categorical variable.
00:29In this example, I'll use the Google Search's information from Searches.sav.
00:34To get this, I need to go over to Graphs to Chart Builder.
00:39From there, I go down to the bottom- left on gallery and I go to Scatter.
00:42Now I wanted to use the second one on the top, which is called a Group Scatter,
00:48and I drag that out to the canvas.
00:50From there, I need to get my predictor variable. Let's scale my predictor
00:55variable that's the category and my outcome variable that's a scaled variable.
01:00For this example, I am going to use interest in the NBA as a search term.
01:04So I am going to come over here and get NBA as a Google search term. I am going
01:11to drag that over to the Y-axis.
01:13Then I am going to use two predictors.
01:15One is I am going to use the median age of people who live in the state. That median age.
01:24That's a scaled variable.
01:24so I am going to put it in the X-axis and then it makes sense to me that
01:29interest in the NBA would be related to whether a state has an NBA team.
01:33So I am going to get has NBA that as a 01 indicator variable and drag that over to set color.
01:40Finally, as I scatter plot, you can sometimes find unusual points and you want
01:43to see who they are.
01:44So I am going to come down to the tab for groups and points ID.
01:48There I am going to click on point ID label at the bottom.
01:52Back on the canvas is add the box for point label variable, and I am going
01:56to use this state code.
01:58So I'll just drag that over and now I am ready to go.
02:03Press OK and I get a slightly complicated chart because of all the data names.
02:08I am going to edit those out for a moment, but because I've used a variable,
02:12I'll be able to bring some of them back if I want.
02:14So I'll double click on it.
02:17I can just select the names and I hit Delete for right now.
02:20So what I have is a bunch of blue circles and a bunch of green circles.
02:25The blue circles are for states that do not have NBA teams.
02:29The green circles are for states that do.
02:31To make these little bit easier I am going to modify them and make them solid.
02:35I'll just click on one.
02:36It looks like I better click again to get just the green ones and click on
02:41Fill and make that the same shade, green, and that actually has an effect of
02:47making all of them solid.
02:48Now what I can do is I can click on regression lines.
02:52up on the menu bar here at the second option is called Add fit line at
02:57subgroups, as for regression line, separately for each group.
03:01I can click on that and I get two lines.
03:03One in green for the states that do have NBA teams and one in blue for
03:08the states that don't.
03:10I also see that we have an outlier and what I am going to do is I am going to
03:14come over to the left of this bar to the little target thing.
03:17There's the data label mode.
03:19I can click on that and now because earlier I said that I was going to use the
03:22state abbreviations as data labels, I can come right down here, click on this
03:27outlier, and I can see that it's Utah.
03:30Now there's something then.
03:32The Utah Jazz seems to elicit unusual levels of fan support.
03:37Also people in Utah tend to be rather young on average.
03:40I am going to close this chart because I am done editing it, and now I can see
03:45that there is an association between age and whether a state has an NBA team
03:51that can predict their level of interest in NBA as a search term.
03:55Just as we saw with bivariate graphs, scatter plots are great way to show the
03:59relationship between two scaled variables, and then by simply changing the
04:04markers, you can add a third categorical variable and you can even see how that
04:09new variable changed the relationship between the other two.
Collapse this transcript
Creating 3-D scatterplots
00:00If you have three scale variables that you want to graph, then one interesting
00:05option in SPSS is a 3D scatterplot where you have variables on three different
00:10axes, the X and the Y and the Z axis.
00:13In theory it's a straightforward variation on the regular 2D scatterplot.
00:17In practice though, it can get a little confusing and this will become clear
00:21after we look at one.
00:23For this example I am going to stay with the Google Searches data and
00:26Searches.sav and I am going to chart the relationship between three particular
00:31search terms, between searches for SPSS, for business intelligence, and for
00:36the term "totally lost."
00:38To do this, I go up to Graphs in the menu bar and I click on Chart Builder.
00:44I come down in the gallery to scatter, then I am going to choose the third one
00:49here which is a 3D scatterplot.
00:52Interestingly, there is an option here of adding a categorical variable on top
00:56of it all which actually makes it four variables depicted at once, but I am not
01:00going to work with that one right now.
01:02I am just going to show you what's called this the simple 3D scatter.
01:05I'll click on that, and drag it up to the canvas.
01:09Then I need to pick my three variables, the X and the Y and the Z, and what I am
01:14going to choose is SPSS as our Y axis, Business Intelligence as our X axis, and
01:24Totally Lost as the Z axis, and from there I can simply click OK.
01:31When we first get the chart it's a rather chunky looking orthographic
01:35projection of a bunch of circles floating in what appears to be a 3D space.
01:40Unfortunately it's hard to read and there are two ways of getting some sense of
01:44depth in this. One doesn't work very well and the other one works slightly
01:48better. I will show you both.
01:49First, we're going to need to edit the chart by double-clicking on it and then I
01:53am going to clean things up a little bit by getting rid of the decimal places on
01:57the axes, click on those, then I come up to Number Format and I am just going to
02:02put 0 and press Apply.
02:05I will do it for the other ones and then there is the last one.
02:14Okay, now to try to get a sense of depth here, one choice is to click on these
02:20then come over to Spikes and choose Floor and click Apply, and that I think you
02:26can tell is not helpful.
02:28We have a bunch of pinpoints here but it just seems to make things much more complicated.
02:32So I am going to click on those again, go back to Spikes, and deselect them.
02:37On the other hand, we do have another option.
02:40Now, I am going to first take these markers.
02:42I am just going to make them solid so they are a little easier to see as we
02:47take care of things, and what I do to this is with a 3D chart you can actually
02:51add motion. You can make this dynamic chart.
02:54If I come over to the chart and I right -click on it, the second choice is this
02:59one, this says 3D Rotation.
03:02And what I can do now is see how the cursor is turned into a hand? I can click
03:05on that and I can start moving things around.
03:08Now, it's kind of fun. I can see there is an outlier there of some kind, right
03:16over here, and I believe from past experience that is Washington, D.C. I can
03:22get state labels and confirm that.
03:24But right now what I am going to do is I am just moving this around and when
03:27it's moving you can get a sense of a three-dimensional cloud of data, and it's
03:33kind of a neat way to do it.
03:34The problem of course is it's really hard to read.
03:38I don't really know what's what except there seems to be an outlier there and
03:42there appears to be some kind of association between the variables.
03:46I can see that there is an association in 3D, but it's hard to read.
03:51A rotating three-dimensional interactive scatterplot can be a lot of fun.
03:55You can even add a four-dimensional variable with colored markers and it helps you
03:59to identify cases that are multivariate outliers, and that is that have unusual
04:04combination of scores.
04:05On the other hand, the problem is once the 3D chart stops rotating, it becomes
04:10just another flat 2D chart that's very hard to read.
04:14And for this reason, a better option might be to employ what are called multiple
04:19static 2D charts in a scatterplot matrix which is what I will show you next.
Collapse this transcript
Creating scatterplot matrices
00:00In the last movie we looked at a way of showing three scaled variables and maybe
00:05even a fourth categorical variable on top using the 3D scatterplot.
00:10Well, that seems like an intuitive approach and while they certainly are a lot
00:14of fun to play with while rotating the display, they can get confusing and
00:18also once they stop rotating, they're just another static 2D display that's poorly labeled.
00:24Nevertheless, it's important to be able to see the relationships between
00:28groups or variables.
00:29Fortunately, a slightly lower tech, but more effective solution is available by
00:34taking advantage of what the data visualization people call small multiples.
00:38That is we can make an entire collection of 2D scatterplots that are connected
00:43to each other in the matrix, which makes it easier to see how the relationships
00:47between about as many variables as you have screen space for.
00:50Let's see how this works.
00:53I'm going to again use the Google Search data in Searches.sav.
00:57I need to go up to Graphs, then to Chart Builder. From there I come down on
01:02the gallery on the left to Scatter, and the third on the bottom is called
01:09Scatterplot Matrix.
01:10I am going to click that and drag it up to the canvas.
01:13Now it looks a little funny here and on the bottom it just says Scatter Matrix.
01:17You'll see there's only one place to add variables.
01:20That's because I can add more than one variable to that list.
01:23In this particular case what I am going to do is I am going to choose let's
01:27say five variables.
01:28I am going to take SPSS.
01:30I am going to take Business Intelligence and I just drag it down.
01:36You see how it turns into a red plus there.
01:39I'll get Totally Lost.
01:41I will also get Facebook.
01:47And finally, I think I'll give an indication of level of education.
01:54So what I've done is I've dragged five variables into this box at the bottom.
01:59Just in case I need it I'm going to come to Groups and Point ID and I am going
02:05to add a point ID label.
02:06I will use the state code and drag that hear to the Point Label variable and
02:13then I can click OK.
02:16I get an extremely complicated looking chart, but this can be fixed.
02:21We need to edit it a little bit.
02:22I am going to double-click on it.
02:26The first thing I am going to do is I am going to remove this day labels.
02:29I may need those later, but for right now I can take them out.
02:34Then the next thing I am going to do is I am going to make the chart bigger.
02:38Right now the chart size is 375x468.
02:40I am just going to make it say for instance 500 and that gets it to 625.
02:50When I do that and I maximize this window, I can actually read all of the labels.
02:56I can see things more clearly.
02:57Next I am going to make these dots smaller. I'll click on those.
03:01Let's go to 3 point and I will make them solid.
03:07And now it's a little easier to see them distinguished from each other.
03:12The next I will do is I am going to add a regression line and I'll go
03:17through all of them.
03:18Let me click on this and there we have it.
03:22I can close this all now.
03:24Now I'm going to change the color of that regression line.
03:26I will make it a dark red instead of red so it doesn't jump out quite so much.
03:34What you have is each variable paired with the others by going across.
03:38So for instance on the top row where it says SPSS on the side, this is the
03:43relative importance of SPSS as a Google Search term.
03:47That's SPSS on the Y axis for all of the other ones.
03:51So, for instance, on the top row in the second column that's Business
03:55Intelligence across the bottom and SPSS up the side.
03:59The one next to it is Totally Lost across the bottom of the X axis and SPSS on the Y axis.
04:05What you can see is when the regression lines are sloped that you can see
04:10their associations.
04:12So for instance there's a very strong association in the top row between
04:16SPSS and Totally Lost.
04:18That's the one in the middle on the top.
04:21On the other hand there's a little bit less of an association between SPSS and
04:25Facebook, the one right next to it.
04:27That line is relatively flat.
04:29On the other hand we do have outliers showing on some of these and it might be
04:33interesting to see who that is.
04:35So I am going to double-click on the chart.
04:36I am going to turn on the Data Label mode by clicking in the menu bar here.
04:42I am going to find our little outlier here and just click on it and it will
04:47label it in all of the charts.
04:50And as is frequently the case it's Washington D.C.
04:52So we can see Washington D.C. is an outlier in most of these charts.
05:00A scatterplot matrix in SPSS is a great way to see the connections between
05:05multiple variables all at once.
05:08It's easier to read than a 3D scatterplot and it lets you include more variables
05:12than you might otherwise be able to do.
05:14It's also a great tool to get a lot of visual detail from your data all at once,
05:18which is after all the purpose of data graphics.
05:20Now that we've covered several different combinations of variables and chart
05:25we will turn next to the descriptive and inferential statistics that can be used
05:29when looking at the associations of three or more variables.
Collapse this transcript
10. Descriptive Statistics for Three or More Variables
Using Automatic Linear Models
00:00In the last section, we looked at ways to chart the relationship of three or
00:05more variables at a time.
00:07In this section, we'll look at ways to give precise numerical descriptions to
00:11those relationships as well as inferential tests to check the reliability of our numbers.
00:17The very first procedure that we're going to cover here is one of the most
00:20impressive features that SPSS has added for version 19.
00:24It's called Automatic Linear Modeling.
00:27It's a huge step towards making data analysis a little easier, a little more
00:31accurate, and a lot more interpretable for a lot more people.
00:34Don't worry if you have an earlier version of SPSS. I'll also show you how to
00:39accomplish the same goals using procedures that are available in every version
00:43of SPSS in the next video.
00:46The goal of SPSS's Automatic Linear Modeling function and linear regression in
00:50general is to have an entire group of predictor variables.
00:55This can be scale variables, or ordinal, or dichotomous indicator variables.
00:59That's the 0/1 variables.
01:01You can even use multiple group categories if you break them down into a series
01:05of dichotomous variables.
01:07But the goal of linear regression is to take these predictors and find the best
01:11way to combine them to predict values on a single scaled outcome variable.
01:16While the mathematics behind this can get very involved and there are plenty of
01:20decisions that can be made, the Automatic Linear Modeling procedure has been
01:24developed to keep most of that in the background and to let you focus on
01:27interpreting your data.
01:29This is how it works.
01:31To get to the Automatic Linear Modeling, we first go to Analyze, then down
01:36to Regression, and then over to Automatic Linear Modeling, which is the first choice.
01:42From this, SPSS takes the information that we gave it about the variables about
01:46whether they were predictors.
01:49That is, they were input variables or whether they were targets or
01:51whether they were both.
01:53So this is a situation where the role that we gave a variable in the dataset
01:57makes a difference in how things work out.
01:59The first thing we need to do is pick our target variable.
02:02I'm going to use searches for the term SPSS.
02:05That will be my target variable.
02:08Now, it's going to ask me what I want my predictor variables to be.
02:12I'm going to add a bunch of these ones about other searches in Google.
02:17I can put those in here.
02:20I can leave those in with the other indicators about whether they have an NFL
02:24team, or an NBA team, or a Major League Soccer team.
02:27I can have this information about Census Bureau Region.
02:31I'm going to remove these four about Census Bureau Division, because that's just
02:35subcategories of the region.
02:37So I'm going to remove that.
02:38Then these three, Northeast, Midwest, and South, are indicator variables that
02:43I use for the region.
02:44However, the nice thing about Automatic Linear Modeling is you can put
02:48categorical variables with several categories in them and it will break them up
02:52in a way that makes best sense for the data.
02:55So you can leave categorical variables in there as they are.
02:58I don't need these dichotomous ones as a backup.
03:01So this is the list of potential variables that I can use as predictors, to try
03:07to get the relative importance by a state of SPSS as a search term in Google.
03:14I'm then going to come up here to Build Options.
03:17It has been our objective and we have a creative standard model.
03:21That's what we're going to do.
03:22The other ones that are called Boosting, and Bagging, and the Large Datasets,
03:27those are technical things that we don't need to worry about.
03:29However, I am going to come to Basics, and this is asking me whether I want it
03:34to automatically prepare data and truthfully, this is a wonderful thing.
03:37It's a great way to deal with outliers and to transform variables and to
03:41make substitutions and it's one of the big perks of the Automatic Linear Modeling approach.
03:46The next thing I'm going to go to is Model Selection.
03:49This is where things can get very complicated in regression.
03:54It's asking the Model Selection Method.
03:56That is, how it decides which variables to put into the regression model.
04:01I have several options. Forward Stepwise.
04:03I'll say one that says just put them all and then leave them there, and another
04:07one called Best Subsets.
04:09Now, when we get to the Linear Regression Command that's separate from this one,
04:13you'll see that we have some different options.
04:15I'm just going to leave this at Forward Stepwise, because it can make life
04:19a little bit simpler.
04:20There is also an issue here about what criterion it wants to use.
04:24There are several choices here.
04:26The AICc, there is also the F- statistic, and adjusted R-squared.
04:31Let's not worry about that.
04:32Let's just use the Information Criterion.
04:34Then we can ignore these other options, and then these ones are about Ensembles
04:39and about Advanced, we can just ignore.
04:41So the last thing I need to do is going to go to Model Options and we don't
04:46need to worry about these options. We can just leave the defaults here.
04:49So now we can come down to the bottom and we can press Run to see what it gives us.
04:54Automatic Linear Modeling produces this one small chart and it doesn't look
04:58like a huge amount, but this is a Model Viewer.
05:01When you click on it, it's interactive and it does a lot of other things.
05:05So I'm going to double-click on this to open up what's called the Model Viewer
05:10window. Maximize that.
05:13What you see here is first it says what's the target variable, the thing that
05:17we're trying to predict, and that is SPSS and its relative importance as a search
05:21term in Google on a state-by-state basis.
05:24The Model Summary also tells us that it's using automatic data preparation and
05:28it's using a Forward Stepwise model selection method for deciding which
05:32variables go into the model.
05:34Now, the bottom one the information criterion has a number.
05:37That's not really inherently meaning in and of itself, but the lower the number,
05:41that is, we have negative numbers, so the greater the absolute value of the
05:44negative number, the better the prediction.
05:47Beneath that, where you show that we're able to predict about 79% accuracy in this model.
05:53So that's good.
05:55What I'm going to do now is I'm going to come over to the little list of
05:58thumbnails on the left and start going through these one at a time.
06:01That's the one we're at right now.
06:05The second one shows what the Automatic Data Preparation did and what it is, is
06:09that we have a lot of outliers and what it's done is it's trimmed the outliers.
06:13Actually, it didn't really trim them, because trimming means throwing away that data.
06:17Instead, technically what SPSS did is something called Winsorising where it
06:22takes the outliers scores and simply replaces them with the highest or lowest
06:26non-outlier scores.
06:27So it brings them in.
06:28This is a non-uncommon practice in business setting, so it's a nice way to do it.
06:34Also, when we have categorical variables like the Region, SPSS is able to merge
06:39categories in a way that maximizes their predictability.
06:43So that's a nice thing.
06:45So that's what the Automatic Data Preparation has done.
06:47The third window shows us what's called Predictor Importance.
06:53Predictor Importance is actually a rather sophisticated statistical calculation.
06:58There are a number of things that go into it.
06:59It's not just a matter of probability values.
07:02It's not just a matter of correlations with the outcome. There is much more to it than that.
07:08But the relative importance is a very easy thing to understand.
07:12What this is telling us is that there are three variables that have a lot
07:16of importance in explaining the levels of relative interest in SPSS as a
07:21Google search term.
07:22The first is the use of Regression as a search term.
07:26That's not surprising, because that's a major thing that SPSS is used for.
07:30The second one amazingly is Totally Lost, which seems to show up a lot with SPSS.
07:36The third one is the percent of population with a Bachelor's degree or higher.
07:40So these are the three major variables.
07:42We're going to have more about those.
07:44The next chart is the Diagnostic Plot.
07:47It lets us know the observed value of SPSS interest for each of the 51 states in
07:54Washington, D.C., along with its predicted value.
07:57The idea here is that they should stay close together, that the observed and the
08:00predicted should be pretty close. Otherwise we don't need to worry about this.
08:05This is a histogram of Residuals.
08:07That's how far off the predictions were.
08:09Again, if we had a thing that looked really unusual here like a big spike at
08:13one end or the other, we might have a problem, but we're not going to worry about this one.
08:17I'm going to scroll down a little and I'll go to the next little page.
08:22This is a list of particular outliers and it tells us what their score was.
08:26For instance we had one place that had a score on SPSS of 3.364 and what that means
08:33is that state showed a relative interest in SPSS as a Google search term that
08:37was 3.364 standard deviations above the national average.
08:42There is another measure that's related called Cook's Distance and this doesn't
08:46necessarily mean that these were outliers in this absolute sense, but they are
08:50the most extreme cases.
08:52The next one down is a graph of the effects of various predictor variables.
08:58We have Regression as a search term but transformed because it's removed the
09:02outliers and then Totally Lost and then Degree was also transformed by removing outliers.
09:09This is a Diagram View.
09:11You can also get a Table View and you can even expand this to see the various terms.
09:19If you need an analysis of variance table for whatever purpose, here it is.
09:23I'm going to skip over to the next box and here we have coefficients.
09:28The coefficients are the actual numbers that you use to multiply things by.
09:32The Intercept is in there and then we have Regression, and Totally Lost, and Degree.
09:37Please note the Degree 1 is a different color because it's a
09:39negative coefficient.
09:41This would become clearer if we come down and instead of having the diagram
09:45we look at the table.
09:47Here, we can now see the coefficients.
09:49The Intercept, that is the standard value that we give to everybody, is 0.87.
09:54So we assume that a state is 0.87 standard deviations above the mean in
09:59their interest in SPSS.
10:01Then for every standard deviation above on Regression, we add another half
10:08of standard deviation.
10:09For every standard deviation above on Totally Lost, we add a little over a half 0.58.
10:15On the other hand, for every percentage point of the population that has a
10:20Bachelor's degree or higher, we subtract 0.03 standard deviations, and so this
10:25is another way of looking at the relative contribution of the variables.
10:29I am going to scroll down a little further.
10:31We have another one here that gives estimated means charts and these are
10:34straight lines, because these are just the slopes of the lines that we give
10:37in the coefficients.
10:39I don't think there is anything terribly important there, so I'll skip to the next one.
10:43This is a table that shows us the three variables that got included and then
10:47across the top is the information criterion and you can see that the number goes down.
10:52It charts at -52 and when they add Totally Lost, it goes to -73.
10:55Now, it adds Degree.
10:56It goes down to -75 and that was the criterion for deciding whether to include a
11:03variable, is whether it lowered the value on information criterion.
11:08The very last thing is just a quick summary.
11:12You can click on to see what got included and what the options were.
11:19Just a quick written summary of the entire model.
11:22So the Automatic Linear Modeling function in SPSS is a fabulous option for those
11:27who want to make a sophisticated analysis and have thorough reporting options
11:32without having to make a million decisions on their own.
11:35It makes it much, much easier to sift through a large dataset and see what
11:40useful patterns might emerge.
11:42I encourage you to spend some time to check out all of its options because there
11:45is more than I've covered here and explore how it might be able to help you in
11:50understanding your own data.
Collapse this transcript
Calculating multiple regression
00:00In the last movie we covered SPSS's new Automatic Linear Modeling function,
00:06which takes a lot of the stress out of statistical analysis.
00:09It can also let you control almost everything manually should you so desire.
00:12On the other hand, you maybe using an older version of SPSS that doesn't have
00:16Automatic Linear Modeling, because that's something that's new with version 19,
00:21or you may want to include some options in your analysis that it doesn't have,
00:25such as something like Hierarchical Blocking, which I use frequently.
00:29In that case, you'll want to turn to SPSS's Standard Linear Regression function,
00:34which is what we'll discuss in this movie.
00:36The goal of regression is pretty simple.
00:39Take a collection of predictor variables, multiply all of them by certain
00:43weights called regression coefficients, which are related to the impact that
00:47each variable has on the outcome.
00:49Add them all up and predict scores on a single scaled outcome variable.
00:53The actual work involved in this process can of course get much more
00:57complicated, but the general concepts remain the same.
01:01Now in this particular movie, we're going to look at the most basic form of
01:04multiple regression where all of the variables are entered at the same time in the equation.
01:08It is after all the variable selection and entry that causes most of the fuss in
01:12statistics, and here's how it works.
01:15I'm going to be using the same Google Search data set that's similar to the
01:19marketing research people would be trying to do in terms of ways of determining
01:22the mind share of particular ideas in Google searches.
01:27What we need to do is go up to Analyze and then down to Regression, and we're
01:32going to go to the second choice here, Linear.
01:35Linear means straight line.
01:36It's going to try to put straight lines through the data, and what we need to
01:40do is get our one dependent or outcome variable, the thing that we're trying to predict.
01:44I'll use interest in SPSS as a search term in Google, and then we pick the
01:50independent variables, those things that will be used to predict the levels.
01:55I'm going to use a bunch of other search terms from the Regression down through FIFA.
02:00I'm also going to use some dichotomous variables.
02:03Whether they have an NFL team, and NBA team or a Major League Soccer team. Put those in.
02:09Scroll down a little bit.
02:10The Percentage of the Population with a bachelors degree or higher, whether they
02:13have an outline for high school statistics, the Median Age.
02:18Now in the Automatic Linear Modeling I was able to simply include a categorical
02:22variable of the Census Bureau region.
02:24It has four regions and that procedure, Automatic Linear Modeling, was able to
02:29compensate for the fact that we had four different categories of no particular order.
02:34In the Standard Linear Regression we can't do that.
02:37The predictors need to either be scaled variables, they can't be ordinal
02:40variables, or they need to be dichotomous, 01 indicator variables.
02:45Now when you have a categorical variable, you don't need the same number of
02:50indicator variables as you have categories.
02:53The same way, for instance, to indicate gender as either male or female we
02:57only need one indicator.
02:58If we want to indicate four different regions in the United States, we only need
03:02three indicator variables, because if it's zero on all three of them, then the
03:06fourth category is implied.
03:09So I'm going to use these three indicator variables.
03:12Northeast, Midwest and South.
03:14I'm going to add those as well.
03:16Now let's come over for just a moment to Statistics and see if there is anything
03:21in here that we need for right now, and there isn't. There are times when having
03:25the R squared model change can be a very handy statistic, but we're using what's
03:29called Simultaneous Entry where we put everything in the model at once so there
03:33isn't a possibility of a change.
03:35I'm going to hit Cancel.
03:37These are some diagnostic plots that we could get.
03:40I don't think we need any of those.
03:42If we wanted to save the predicted scores or other diagnostic statistics, we
03:47could do those with the Save menu.
03:50We don't need any of these for right now.
03:52Let's look at the other options.
03:54Now these are criteria that are used for entering and removing variables.
03:59Now we're not using an automatic procedure. We're simply entering everything at once.
04:04If we wanted to replicate the procedure that was used in Automatic Linear
04:07Modeling, we would use a Forward Stepwise Regression and then these criteria
04:12for entry would matter.
04:14But now we're not going to worry about them.
04:16I'll just press Cancel now.
04:17And so really we're just using the defaults.
04:20I picked my one dependent variable, which needs to be scale variables, and then
04:23I put in a whole collection of independent variables, and now I'll press OK.
04:28And we get a bunch of tables out of this one.
04:31The first table, which indicates variables entered and removed, is not helpful.
04:34You can just ignore that.
04:36The second variable called Model Summary gives what's called the Multiple
04:40Correlation. The capital R in the second column tells you what the correlation
04:44is between all of the variables together.
04:46It's an analog of the individual correlation, which is usually lowercase r.
04:51This is 0.937, which is a huge correlation, considering it goes from 0 to 1.
04:56The R squared, which is often a better indicator, because you can read it as a
05:00proportion of the variance in the outcome that could be predicted by the
05:05predictor variables, 88% is enormous.
05:08The next one, the Adjusted R squared, is also sometimes reported.
05:11You'll see that it's smaller.
05:13This has to do with the ratio of predictor variables to the number of cases.
05:17Now truthfully, I've probably used more predictor variables than I should,
05:20because really I only have 51 cases, the 50 states in Washington, DC, but it
05:26still works for my purposes.
05:27The next table is the Analysis of Variance Table and that provides a
05:30statistical hypothesis test for whether the entire model as a whole can predict at better than 0%.
05:38And the answer of course is that yes.
05:40I'm looking at the number that's on the far right under Sig, where it says 000.
05:45If that number is less than 05, and this one isn't literally 0,
05:48it's just less than 001, then the model is statistically significant as a whole.
05:54The table below that gives the actual regression coefficients.
05:58You have what are called Unstandardized Coefficients, which were in the original metric.
06:03So for instance, if it were years, that says for every year add this much more to
06:08your predicted value.
06:10If it were dollar, say for every dollar, then add this much to the predicted value.
06:14Now the Google Search terms, which are in quotes, those are already standardized
06:19ones, but if you go down to Has an NFL team or Has an NBA team.
06:23So the one that Has an NFL team is .068 and what that says is for a state that
06:29has an NFL team add .068 standard deviations to the prediction of their interest
06:37in SPSS relative to other terms in Google searches.
06:41Next to those is the standard error, which is an indication of how spread out
06:44the variation is, and if you take the B weight or the regression weight and
06:48divide it by the standard error, you get to what's called a standardized
06:51coefficients or a beta weight.
06:53And those are actually really nice, because those are similar to correlations.
06:57They go from 0 to 1.
06:58They can be positive or negative and they indicate the degree of a
07:02linear relationship.
07:04Next to those are the T-tests.
07:06Those are individual inferential statistics for each one of the regression
07:11coefficients, and next to those is their significance level.
07:14So we can go down to that column at the end, the Significance levels, and look
07:18for ones that are less than 05.
07:20We see for instance that Regression is a statistically significant predictor of
07:25interest in SPSS as a search term, so it's totally lost.
07:29And if we scroll down, we see that really those are only the two in that
07:33collection that do it.
07:34Now you may recall in Automatic Linear Modeling we had three or four that
07:39mattered, but that's because it used a different procedure where it was
07:43selective about what it entered and it also had a different criterion and we are
07:47seeing the overall changes in the information criteria.
07:50This time we're just using probability values for individual
07:53regression coefficients.
07:55Now a really important thing here is the beta coefficients I said are like
07:59correlation coefficients.
08:00That's true to a certain point, but the big difference is that correlation
08:04coefficients are only valid on their own.
08:06Each correlation coefficient is calculated separately with the outcome.
08:10These, however, are only valid taken as a group; each one of these influences the other.
08:15So this can be very different from the correlation coefficients and it can be
08:20helpful to compare the two of them.
08:22This is the most basic version of multiple regression.
08:26It doesn't have to be an impossibly complicated rocket science affair.
08:30Instead, it can serve a quick insight into what could be a large and very
08:35complicated data set.
08:36It can give you some real clarity to start with.
08:39The Automatic Linear Modeling function can do a lot of this and a lot more
08:43without too much direction from you, but there are situations where you
08:46would want to use the legacy command, and I especially find the standardized
08:50coefficients to be priceless, so I can compare them with correlation coefficients.
08:55I recommend that you take a little time and see how SPSS's linear regression
08:59feature can help you deal with the complexities of your own data.
Collapse this transcript
Comparing means with a two-factor ANOVA
00:00The last deferential test that we'll look at in this course is a variation on
00:04the Analysis of Variance or ANOVA or ANOVA.
00:08As we discussed in the sections on associations, the Analysis of Variance is
00:12a very flexible and powerful procedure and there are probably dozens of
00:16permutations on it.
00:18In this movie we're going to talk about the version that is designed for
00:21situations where two categorical variables are used jointly to predict scores on
00:27a scaled or quantitative outcome variable.
00:30Because categorical variables are generally referred to as factors in the
00:34Analysis of Variance and the categories that make them up are called levels,
00:39this version of the Analysis of Variance is usually called the Factorial ANOVA,
00:44or more colloquially, a Two-Factor ANOVA.
00:47An important thing to note is that when you have two separate factors like
00:50gender and educational category and you're looking at levels of discretionary
00:54spending, an Analysis of Variance will give you three different results.
00:59The first result will let you know whether spending differs by gender,
01:02ignoring educational level.
01:04The second result will let you know whether spending differs by educational
01:08level ignoring gender.
01:10These are both known as the main effects where effect has to do with the
01:14statistical association and main because their factor has an effect on its own.
01:19However, an Analysis of Variance also gives you one more important result.
01:24It lets you know whether the two factors interact.
01:27That is, it lets you know if for example, women with college degrees spend more
01:32than women without college degrees, but for men, their spending is the same with
01:35and without a degree.
01:37By the way, I'm just making that up. I don't really know what the association
01:40between those variables is, but I'm sure that some of you actually do.
01:44In some domains, the interactions are particularly interesting and can take
01:48precedence over the main effects.
01:50However, it all comes down to interpretability and applicability and that will
01:55depend on what you are trying to do with your data.
01:58With that in mind, let's see how a Two-Factor ANOVA can work in SPSS.
02:03To do the Analysis of Variance, we need to go to Analyze and down to
02:07General Linear Model.
02:09Now that actually is an interesting term, and the idea here is that all of the
02:12procedures that we've done, T-Tests and Regression and Multiple Regression are
02:17all variations and once called a General Linear Model, a way of predicting
02:21scores on a single outcome.
02:23Let's do this one over here, Univariate.
02:27Now what do we need to do is pick our main dependent variable.
02:31that's the outcome variable, this thing that we're trying to predict.
02:35In this particular example, I thought I might use interest in NBA as a search
02:40term, so I'll put that up in the dependent variable.
02:43And then I'm going to use two categorical variables as predictors of interest
02:48in searching for NBA.
02:50The first one that makes a lot of sense to me is whether a state has an NBA team.
02:55So I'll put that here under Fixed Factor(s).
02:58When you have categories that are determined like yes or no, they have an NBA
03:02team, then it's a fixed factor.
03:04You can also have what are called random factors in the Analysis of Variance,
03:08but in many situations, those are unusual and I've never used them.
03:12A covariate there is if you want to throw in another quantitative or
03:16scaled variable, by putting covariates into analysis can complicate the
03:20results dramatically.
03:22The last one is if you want the Weight Cases and we're not going to deal with that.
03:25I'm just going to go back and find my second predictor category and that's
03:29going to be region of the United States.
03:32And I can just click that one and put it in there.
03:34Now it's okay that there are four levels in this category. The Analysis of
03:37Variance is able to deal with that just fine.
03:39Let's take a quick look at some of the options here.
03:42Under Model, I can specify whether I want something called a full
03:46factorial model or custom.
03:47We don't need to worry about that. We can Cancel.
03:50Under Contrasts, I can try to decide if there's special ways I want to compare
03:55the results, and I don't need to worry about that.
03:58Under Plots, I could get Profile Plots, but these can get a little complicated,
04:03so I'm going to cancel that.
04:04Post-HOC lets me look at the differences more effectively. I'm not going to
04:09do that on this one.
04:11If I want to save the predicted values or if I want to save some other
04:15statistics for diagnostics, I could do that, but I'm going to skip it for now.
04:19And finally under Options, there are some here that I might want to do.
04:22I might want to get what are called descriptive statistics and estimates of effect size.
04:27I think those two are really helpful.
04:28Then I'm going to press Continue.
04:31And I've got it set up the way I need, so I'll just click OK.
04:35And so here are my results.
04:37The first thing is I get an indication of what are called the
04:40Between-Subject Factors.
04:42These are the things that separate one group from another.
04:44One factor is whether a state has an NBA team and you can see that 23 of them
04:49do and 28 of them don't.
04:52The second thing is the Census Bureau region.
04:55You see that I have nine states in the Northeast, 12 in the Midwest, and so on.
04:59Below that, I have the actual descriptive statistics for the search interest in NBA.
05:07Well, it's breaking it down by whether they have an NBA team and by the
05:11Census Bureau region.
05:12So the states in the Northeast that do not have NBA teams have a mean of
05:17minus .42. That means that they are about half a standard deviation below the
05:22rest of the country in relative interest in searching for NBA teams.
05:26On the other hand, if you go to the Northeast teams that do have NBA teams, you
05:31see that they have a score of +.39.
05:35That means they're about four-tenths of a standard deviation above the national
05:39average in relative interest in searching for NBA on Google.
05:43And then you can run through and see the various combinations there.
05:47The next table is the actual analysis of variance table, and what it has is
05:52several different results here.
05:53The first one that says Corrected Model simply tells me how well the model as a
05:59whole works and it predicts rather nicely.
06:01You can see that it has a Significance level in the first row of 000.
06:06And it also has something called a Partial Eta Squared.
06:09Again, it's like a correlation that's squared and it's .492.
06:13In fact, if you look at the footnote at the bottom of that table, you'll see it
06:16says R Squared = .492.
06:19And what it means is that if we know the region of the country that a state is
06:22in and whether that state has an NBA basketball team, then we can accurately
06:27predict about 50% of the variance in interest in NBA as a Google search term.
06:34So that's the entire model.
06:35The next step down on that table is Intercept and that just means that the
06:40starting score is not 0 and that's not terribly interesting in and of its own.
06:45What's funny here is that it actually is close to 0.
06:47The next one is whether a state has an NBA team, has_nba, and you can see there
06:53that it's highly significant.
06:55Their probability value is 000 and the Partial Eta Squared is .412.
07:01And what this lets us know is that most of the interest in NBA as a search term
07:06has to do with whether a state has an NBA team.
07:10So that's a major predictor.
07:12The next one is region.
07:14Is there region by region interest?
07:16The significance level is .079 and that's above the standard cutoff of 05, so
07:22we would say that on the whole, no, the region that a state is in does not make
07:26a big difference in terms of their interest. On the other hand, whether they
07:30had an NBA team did.
07:31Those are the two main effects that an Analysis of Variance gives us.
07:36There is however the third thing that I talked about: the statistical interaction.
07:41And that is whether the region interacts with whether a state has an NBA team to
07:45predict overall interest.
07:46And you see that on this one, the significance level on the second to last
07:50column, the last entry is .049, which is just barely beneath the 05 cutoff and
07:57there's enough to be considered statistically significant.
08:00Now what we're going to need to do is very quickly make a chart to show what
08:04these differences look like.
08:05I'm going to do that really quickly in the graph.
08:09Go to Graphs, to Chart Builder, I'll get a Clustered Bar Chart.
08:15And from there I'll take NBA as an interesting search term and I'll take
08:21whether they have an NBA team, I'll make that cluster and I'll put the Region on the X axis.
08:29And when I do that, you see what's going on here.
08:34The bars in green are for states that have an NBA team and you see every region
08:40where they have NBA teams, there are above-average interest in searching for
08:43NBA, and it makes sense.
08:45The states that don't have NBA teams are in blue and they'll have below average
08:50interest, regardless of the region, except you do see an interesting thing.
08:54In the South, the states that have NBA teams, and there are several, are barely
08:59above the national average in terms of interest.
09:03But in the West, the states that have NBA teams have huge amounts of
09:08interest, much higher.
09:09And so you can see that the effect of having an NBA team varies according to region.
09:15And that's the idea of a statistical interaction.
09:18it's one of the benefits of an Analysis of Variance.
09:21And so, for our final inferential test, the Factorial Analysis of Variance, you
09:26see this is an excellent way of looking at the association between two
09:30categorical predictor variables in a single-scaled outcome variable.
09:35It lets you look at the statistical effect of each of the categorical variables
09:39on its own, as well as the interaction of the two, which can often be more
09:43interesting and more important.
09:45And with that, we'll conclude our last section on statistical graphing and testing.
09:50In the next and last section, we'll wrap things up a little and talk about how
09:54you can get all of your results out of SPSS and format them, so they'll be as
09:58clear and as communicative as possible.
Collapse this transcript
11. Formatting and Exporting Tables and Charts
Formatting descriptive statistics
00:00In the last several dozens movies, we have talked about ways that you
00:03could explore your data with graphics and descriptive statistics and
00:07inferential procedures.
00:09And while that's a great way for you as the analyst to get a thorough
00:12understanding of your data, if you really want your analysis to accomplish
00:16something useful you will have to communicate it to others.
00:19Now we've already discussed ways to modify charts as we have covered these charts;
00:22however tables can be an important part of communicating information.
00:27In fact, when I'm writing a research report I try to put all of the results and
00:31graphs and tables and then use the text to simply describe the patterns without
00:35including the numbers there.
00:36In this movie, we will look at a way to format your tables to make them easier
00:40to follow and easier to communicate to others.
00:43In the next one, we will talk about ways to show correlation matrices and the
00:48results from regression analysis, and then finally we will have a movie that
00:51talks about how to export tables for use in other programs like word processors
00:56and spreadsheets and presentation software and webpages.
00:59For this example, I am going to be using the Google searches information,
01:03searches.sav that I have used in several others.
01:06I am going to start by getting some descriptive statistics here.
01:09I am just going to come up to Analyze, to Descriptive Statistics to Frequencies,
01:16and what I am going to do is I am going to get the information about several
01:19variables that I could use, for instance, try to predict people's interest in
01:23SPSS as a search term in Google.
01:26I find it helpful to begin with the outcome variables.
01:29We will take SPSS and move that over.
01:31I might want to include Business Intelligence and Data Visualization.
01:35I might also want to include my Education Variable, the Percentage of each
01:39State's Population with a Bachelor's Degree or Higher, and then I might want
01:42to include the Age.
01:43Now you see these are all scale variables. We got a little measuring stick
01:47right next to each one.
01:48I would also want to use these three region variables, but because those are
01:54dichotomous indicator variables I don't need the same kinds of statistics for them.
01:58So I am going to skip them for right now.
02:00Then what I am going to do is I am going to choose the statistics that I want,
02:04I want the Mean and the Standard Deviation and then I want what's called the
02:07Five number summary.
02:09That's the five quartile scores, the Minimum, the Maximum, the first
02:13quartile, the second quartile, which is also the median or the 50th
02:16percentile, and the third quartile.
02:19I get those by clicking on the Minimum, the Maximum and Quartiles, and now I am
02:23ready. I can press Continue.
02:25And I don't want the frequency tables and I don't need any charts right now, so
02:29I am just going to press OK.
02:31And there we go. I have a short table.
02:33This is pretty easy to follow; however, there are too many decimal places and
02:37some of the statistics are out of order and I don't like the way the labels are.
02:41The easiest way to take this as to simply right-click on the table and copy it,
02:46and once that's copied I can go into Microsoft Excel and I'm going to go to the
02:51second column and I'll paste the table there.
02:54The reason I used the second column is because I find it very helpful to have
02:58one column that can maintain the original order of things.
03:02I just type in a couple of numbers and then I can drag down and propagate the list.
03:06Then I can start deleting information that I don't need.
03:09I don't need this title. This is statistics.
03:13I do want to rearrange and use different names for the statistics that are in
03:17columns B and C. However, you'll see that SPSS has merged some of the cells
03:22which makes it harder to deal with.
03:23So what I'm going to do is I'm going to insert a new column and I will just call
03:28it Statistics, and then I'll put the names of the statistics.
03:34You may want to call them different things.
03:35I have a particular set of abbreviations I frequently use.
03:38N is common for the sample size, and Missing I'm going to delete in a moment so
03:42I am not even going to add that.
03:43M for Mean, SD for Standard Deviation.
03:47Then the next five numbers are quartiles.
03:49Now I have a personal preference. This is not a common way of doing it but I like it.
03:54I refer to them as Q0 through 4.
03:57So the Minimum is Q0 because it's the 0th quartile. There's nothing below it.
04:02The Maximum is Q4 because everybody is below it, and the other ones are Q1, Q2, and Q3.
04:10And once I have got those, I can actually take these two columns right here
04:14and I can delete them.
04:16Now the only problem is that these statistics are out of order.
04:19We have 2, 3, 4, 5, 6, 7, but then these ones need to be slightly different.
04:25I can get that if I just change this one to a 12, and then I select this column
04:30and sort, and now the Q4 goes to the bottom.
04:33I don't need this column anymore. I can delete it.
04:36Then I can delete the outlines around here.
04:40I can center everything.
04:42I can make these columns slightly wider and now I am going to deal with the
04:49issue of decimal places.
04:51I don't need this many decimal places for the Percentage of the Population
04:55with the Bachelor's Degree or Higher and the Median Age. I think it's okay to
04:58have these two statistics, the Mean and Standard Deviation, go down to two decimal places.
05:03That's usually adequate for most purposes.
05:05And then for quartile statistics I actually prefer to take them down to no
05:10decimal places, and then over here for the three Google search terms we do have
05:15a separation issue in that.
05:16These are numbers that inherently have a lot of decimal places.
05:19So what I'm going to do is I'm going to bring all of these down to two
05:23decimal places as well.
05:24I am going to delete this column for the missing variables.
05:30You can arrange things slightly differently, but what I want you to see is by
05:33copy and pasting from SPSS into Excel, it gives me a lot more flexibility in
05:38terms of rearranging things, changing the decimal places, renaming, and I can
05:43take the information and put it manually into a form that I feel is going to be
05:47easier to communicate to others.
05:48Now in the next video, I am going show you how to deal with the table of results
05:53from a correlation and then from regression, and you can combine these to make a
05:58overall presentation of your data.
Collapse this transcript
Formatting correlations
00:00In the last movie, we looked at how to take a table of descriptive statistics
00:04in SPSS and then copy and paste it into a spreadsheet, and then in that
00:09spreadsheet to rearrange, delete, and modify the values in there to make them
00:14easier to communicate.
00:16In this movie, I want to show you how to take one particular kind of table, a
00:20correlation matrix, and work with that in a spreadsheet to clean it up and make
00:25it much easier to deal with, where you can go from potentially thousands of
00:29numbers to a small handful and present them in a way that makes them much,
00:33much easier to follow.
00:34For this example, I'm going to be using the same dataset and the same variables
00:38I did in the last one, the Google searches information and searches.sav.
00:42And the first thing I need to do is get a correlation matrix, so I'll come up to
00:47Analyze, to Correlate, to Bivariate Correlations.
00:52Now I find it helpful to take the outcome variable and put that in first so it
00:56shows up in the left column.
00:58In this case, that's the relative interest in SPSS as a Google search term.
01:03The other terms that I used were Business Intelligence and Data Visualization.
01:08I also used an indication of education with the percentage of the state's
01:13population with a bachelor's degree or higher.
01:15I used the Median Age and then I used three indicator variables for the region
01:21of the United States.
01:22Now even though there are four regions with indicator variables, you only need
01:27one less indicator than the number of categories.
01:30So for instance, when we have the two categories of gender, we only need a
01:33single indicator variable to indicate one or the other.
01:36With four categories, we only need three because the fourth category is implied
01:41by zeros on the three variables.
01:43But I can highlight the three of those and move them over and now I just click OK.
01:49Now I have a correlation matrix here and as far as correlation matrices go it's not huge.
01:53I've had ones with hundreds of variables on each side.
01:56But you see that we have the variables listed down this side and the same
01:59variables across the top, and we have several statistics in the cells for each one.
02:04Please note that at this point the statistically significant correlations have
02:08Asterisks next to them.
02:11What I'm going to is I'm going to right-click on this table and copy it.
02:14Then I'm going to go to a spreadsheet.
02:17I'm using Excel in this particular case and I'm going to paste this not into
02:23cell A1 but into B1.
02:29And the reason I'm going to do that is I find it very helpful to have an index
02:33column at the beginning that allows me to restore the order of things.
02:37So I have 1, 2. I can select those and drag down and propagate the order list. Great!
02:45And now what I can do is I can start deleting and reformatting.
02:48So for instance, you see in row one, the word Correlations is a single merged cell.
02:52That's going to make it difficult to sort things.
02:55So I'm going to simply delete that.
02:59Then you can see that in column B, the search terms are merged cells across three rows.
03:04This also causes problems.
03:06The way to deal with that is to simply delete the column.
03:12So I've lost the names of the variables but I can get those back because I have
03:16the same variables listed across the top.
03:18However, I don't need the Pearson correlation and the probability level and the sample size.
03:23All I really want is the correlation, so I'm going to get rid of the other two.
03:28Simply click on a cell in that row and then I can sort the entire table.
03:33Now I have the Ns. They're all 51, so I don't need those in my table.
03:38Then I have the Pearson Correlations, then I have the Sig. (2-Tailed).
03:42Those are the probability levels.
03:44I don't need those.
03:45I need to indicate them in a way and I'm going to delete them for right now.
03:50So now all I have are the correlation coefficients themselves.
03:54I'm going to sort this again to try to get the titles on the top.
04:00I'm going to cut this and then insert it back beneath the titles.
04:05Then in order to get the variable list back on the side where it says Pearson
04:09Correlation, I highlight the list here, I copy that, I come back to this first
04:16one and right-click, and I do Paste Special and Transpose.
04:21And that switches it from horizontal to vertical.
04:25And so you see now I have the variables listed again.
04:27Now I'm going to do something else.
04:29I don't need all of these variables here.
04:32I'm mostly interested in just predicting SPSS, so I can highlight all of those
04:37and I can delete them.
04:39Also I don't need the SPSS correlated with itself.
04:43Now I can remove the borders. I can get this one flush left.
04:48I'm going to stretch this out a little bit, but this one is too long so I'm
04:53just going to call it Degree, and I'll make these other two a little shorter
04:59and center this one.
05:02I don't need three decimal places. Two is plenty.
05:06But now I need to indicate which ones are statistically significant.
05:09I'm going to delete this column also.
05:14Unfortunately, we had asterisks in the SPSS table to indicate which correlations
05:19were statistically significant, but we lost them when we pasted into Excel.
05:24That's not a big problem though.
05:25Wwe could go back and manually check, but I know another way of doing this.
05:30I've provided a spreadsheet called Correlation-Probability-Formulas and what
05:36you can do with this one is you simply enter the sample size.
05:39in this particular case we have 51, and it will tell you what absolute value of
05:44correlation is statistically significant.
05:46In this case, it's 276.
05:49So anything that is greater than an absolute value of 276, so negative that goes
05:54past or positive that goes past it, is statistically significant.
05:58So I can go back to my table here and I can do a quick conditional formatting.
06:04Now it's a little silly when I only have seven numbers here.
06:07But the point is this works just as well as thousands of numbers.
06:10I highlight the numbers, I come over to Conditional Formatting, I click on that,
06:15and I'm going to create a new rule.
06:19And I want to format only cells that contain values that are not between
06:24-0.276 and positive 0.276.
06:30So the values have to be more extreme than that.
06:33Then I go to Format and I can choose Fill and maybe I'll make them yellow and I press OK.
06:42And when I do that, I see that the top three correlations are all statistically
06:46significant because they have absolute values greater than 0.276.
06:51Now it's also helpful to create a legend and highlight it in the same color, so
06:59it's clear that that color means something.
07:02If I want to, I can put a border around this.
07:05Many of you will have training in designing graphics and you'll find ways to
07:10make this even clearer.
07:11But what I've done here is I've taken-- let's look back at the
07:14original correlation matrix.
07:17It's huge. There's hundreds of numbers here.
07:21And I've boiled it down to seven numbers and even then I've highlighted the ones
07:25that are statistically significant to make it easier to find.
07:28So this is one way to take the output of SPSS and to transform it into a way
07:34that makes it easier to communicate and easier to understand.
07:38In the next video, I'm going to show you how to integrate the results of a
07:42regression analysis to compare this and try to make the patterns clear across
07:46the two ways of analyzing the data.
Collapse this transcript
Formatting regression
00:00In the last two movies, we've looked at ways to take output from SPSS and
00:05reformat it by pasting it into a spreadsheet and working with it to get it so
00:10it's clear, simpler, and easier to communicate.
00:13In the first movie, we looked at formatting a table of descriptive statistics.
00:17In the second one, we looked at how to deal with a correlation matrix.
00:22In this third one, I want to show you how to take the results of a multiple
00:25regression and compare them with the results of correlation coefficients, as a
00:31way of communicating the different perspectives that these analyses can give
00:35you and to make it clearer how to interpret them in a meaningful way.
00:40To do this, I'm going to be using the same data sets, Google searches, and the
00:44same variables that I used in the last two examples.
00:47I need to get a linear regression output.
00:49To do this, I come up to Analyze and go to Regression, to Linear.
00:55I need to take my dependent variable.
00:57That's my outcome variable or the thing I'm trying to predict.
00:59That's SPSS and I put that into Dependent.
01:03Then I take all the variables that I want to use as my predictors, the things
01:07that I think will explain interest in SPSS.
01:11And in this case, I'm going to be using the same ones that was used before, searches
01:14for Business Intelligence, searches for Data Visualization.
01:18And then I'm going to come down to the degree, Percentage of a state population
01:25with Bachelors Degree or more, the Median Age, and then my three dichotomous
01:30indicators for Region.
01:33Now I've mentioned before that Region has four categories and the reason we
01:38used three indicator variables for this is because the fourth category, which
01:44would be West, is implied by 0s in all of these.
01:47In the other analyses, it's okay to have a fourth indicator for West, but in
01:51linear regression it's not.
01:53That introduces something called multi-co-linearity and it can really wreak
01:57havoc with the result if you have variables that are correlated entirely with each other.
02:02So that's why we don't do that.
02:04Now to make this one simple, I'll leave it as Enter.
02:07That means it's going to give me a regression coefficient for all of these at once.
02:11I just leave everything at the default and I press OK.
02:16And I have a number of statistics here. The one I'm going to go to right now is
02:20this one that says Coefficients.
02:22Really there is one column here that's of most interest.
02:25it's the one that says Standardized Coefficients Beta.
02:28It's third from the right.
02:29There's an inferential statistic next to it, the T-Test, and then there's a
02:33Significant value next to that.
02:35What I really want is the Beta Coefficients, because those are the ones that are
02:39most comparable to correlation coefficients.
02:43And then I'm going to indicate the statistical significance by highlighting the
02:46ones that are significant.
02:48I'm also going to use some of the information from the two tables above that,
02:53the Model Summary and the ANOVA.
02:54I'll show you those in a moment.
02:56So what I'm going to do is I'm going to right-click on my Coefficients table,
03:00copy it, and I'm going to go to the same Excel spreadsheet that I used for
03:04modifying the correlation coefficients, except for this moment I'm going to
03:08start with the second sheet.
03:10I'll go to B1 and paste the results in.
03:14Again, because that allows me to put in a column, so I can reconstitute
03:19the order if I need to.
03:21And then I'm going to start getting rid of some information.
03:23I don't need this merged cell that says Coefficients on the top.
03:27I don't need this giant merged cell that says Model here on the side.
03:31And then I don't need this one that says t and I don't need the
03:40Unstandardized Coefficients.
03:41So these are the ones in the original metric, but I'm just going to leave those
03:44out for right now, because the standardized coefficients, which are also called
03:48the Beta Weights, are the ones that are most easily compared with the
03:53correlation coefficients.
03:54Now the Constant, the Intercept term, doesn't have a standardized
03:58regression Beta Weight.
03:59That's fine, so we can just leave that out.
04:01And in fact, what I'm going to going to do is I'm going to put here
04:05Predictor, Beta, and then I'm going to put p right here, and I don't need one for the Intercept.
04:13That way I can delete these merged cells up here and I have just these ones left.
04:20I don't need to worry too much about the formatting of the labels here, because
04:27I'm going to use the ones on the other page.
04:30In the last one, I highlighted everything that was statistically significant in the 05.
04:35I'm also going to highlight the ones here that are statistically significant.
04:40An easy way to do that is to come in here to the p values and sort.
04:45And so now all the small p values, the ones that are statistically
04:48significant, are right here.
04:50And then I can highlight those and then if all goes well, I can sort them again.
04:57Now I can delete the p values. All I need are these ones, and I'm going to
05:04copy those and I'm going to go to the first page where I have my correlation coefficients.
05:13And I just want to make sure that everything is in the same order. It is.
05:18These I need to say are correlations and these are beta coefficients.
05:25A beta coefficient is a standardized regression coefficient, and then here I've
05:31got Predicting SPSS.
05:35And so now what I have, I'm going to remove the borders that I actually put in
05:39earlier, and I'll get those all centered.
05:45Here's an interesting thing.
05:46The correlations and the beta coefficients, I'm going to change the decimal
05:50places here, are approximately the same thing.
05:54Now what's interesting about putting the correlation coefficients in one column
05:58and the beta coefficients next to them is you can see actually that there's a
06:01huge contrast between the two of these.
06:04In the correlations, we had three variables that individually had high
06:08correlations with the relative interest in SPSS as a Google search term.
06:13They were Business Intelligence, Data Visualization, and the proportion of a
06:17state's population that had degrees.
06:19All three of those are significantly and positively correlated, and the age and
06:24the region variables were not.
06:26However, when we go over to the regression results, we get a very different pattern.
06:31For one thing, Business Intelligence is no longer significantly
06:34correlated, where there's gone negative but it's not significant, so we'll
06:37treat it as functionally 0.
06:39Degree has also gone negative, but it's not significant.
06:42Data visualization on the other hand is still statistically significant and it
06:47has actually gone much, much higher.
06:49Beta coefficients are like correlations and that they go from 0 to 1. They can
06:53be positive or negative.
06:54This is almost as strong as it can be.
06:57Data Visualization becomes a huge predictor.
06:59And then what's really shocking is that this three region variables, which
07:04individually had no correlation with interest in SPSS, all three of them had
07:08become statistically significant in the regression coefficient.
07:12What this lets us know is that region as a whole does matter and mostly because
07:17the three of these are contrasting with the West, we would want to look at the
07:21relative interest of SPSS in the four regions.
07:24The other thing to keep in mind is that the correlation coefficients are
07:28valid individually.
07:30The correlation of Business Intelligence to SPSS of .49 is calculated on its own.
07:35The next one down between Data Visualization and SPSS, where we have
07:38a correlation of .60,
07:40that's correlated on its own.
07:42However, for the regression the seven beta coefficients are
07:46calculated simultaneously.
07:49If we removed any one of these, all of the others would change.
07:53They're taken as a combination and their values and their probability values are
07:58only valid when taken as a group.
08:01And so that's one of the reasons why I can get very different patterns when you
08:05put in a linear regression result versus a correlation.
08:09Now there's one other thing I want to add for the linear regression.
08:13And that is this thing up here, under Model Summary where it gives the R Squared.
08:19And that is an indication of the proportion of variance in the outcome
08:23variable, which is SPSS searches that can accurately be predicted by the
08:27combination of the other variables.
08:29And what we have here is an R Squared of .589 and what that means is that nearly
08:3460% of the variance in SPSS and just can be predicted by these other seven
08:40variables collectively.
08:42So I'm going to take that .589, I'm just going to insert a row, and I'll label
08:47it R Squared, and I'm going to put down the .589. I'll just round it off right
08:52now and you can actually put that down as a percentage.
08:55And I'm going to leave it highlighted, I'll change that one to a percentage,
09:00and I'm going to leave it highlighted in yellow, because it is
09:04statistically significant.
09:05What that means is it's different from the 0 and the way I can tell that is by
09:09the result in the next table of the Analysis of Variance table where the model
09:13as a whole has a significant value of less than .000 here, but .001.
09:18And so I know that that R Squared value of .589 is statistically significant.
09:23What I have here is a result that says that those seven variables collectively
09:28predict a lot of the interest in SPSS as a Google search term.
09:33What's funny about it is that the pattern from the individual correlations to
09:37the combined regression coefficient changes dramatically.
09:41And it's not the case that one of these is accurate and the other is inaccurate.
09:45They are both accurate; they are just very different perspectives on the issue,
09:49the individual versus the group predicting.
09:52Anyhow, this can be one step in trying to tell an analytic story about your data.
09:57It can get complicated.
09:58it can require some insight and some judgment in how best to interpret it.
10:02But this is a way of taking a huge amount of numbers and a huge number of
10:07tables and boiling them down to a very small concise way of presenting the
10:11results that I think it makes it much easier for you to articulate your story,
10:16your vision of your data analysis.
Collapse this transcript
Exporting charts and tables
00:00In this final video, I want to show you how to take the charts that you create
00:04in SPSS and export them as HTML and as image files as either JPEG or PNG or some
00:10other format that you can then use to integrate into your word processor
00:14documents, into your presentations, or into your web pages, as a way of sharing
00:19the results of your analysis.
00:20For this example, I use the same data set, Searches.sav, and what I am going
00:25to do is I will just make two or three sample charts very quickly and then
00:28show how to export them.
00:30In this particular case, I'll make a bar chart.
00:32I go into Graphs, and then the Chart Builder, then I am going to make a bar
00:37chart of regional variation and interest in SPSS, because that showed up in
00:42our regression results.
00:43So I am going to come down and get the Census Bureau Region, put that in the
00:47x-axis, and get SPSS and make that the variable as being charted here.
00:52I put arrow bars on it and click OK.
00:58And what I see is that the west has much, much lower interest in SPSS as a
01:04relative search term than the other three regions, which would explain the
01:07curious results of our output in the linear regression.
01:11I am going to change these just for a moment, just a small amount.
01:15Really I think all I am going to do is change the colors.
01:19You can change them however you want.
01:21You can make individual bars of different colors.
01:23I will just press Close and close that.
01:26So there's one chart.
01:27Next thing, I am going to make a scatter plot.
01:30Go to Graphs, back to the Chart Builder.
01:33This time I will choose Scatter, and I will bring that up, and I am going to
01:37look at the association between Business Intelligence and interest in SPSS.
01:41Now I'll hit OK and I have got a scatter plot there.
01:46I am going to clean it up slightly. I don't need all those decimal places.
01:50So I am going to number format and change those to zeros.
01:53I will do the same thing over here, and then what I am going to do is I am going
01:59to change those to solid red circles.
02:02Then I am going to add two lines because I can.
02:06There is a regression line, but what I am going to do with that regression line
02:09is actually going to change it what calls a Smoother that follows the pattern a
02:13little more closely.
02:14Then I am going to change the color of that to Grey.
02:18And then I will also add a linear regression line.
02:23It's added a Quadratic. That's okay.
02:24I just change it to Linear.
02:27I can delete that, and I am going to change the color of the linear regression line.
02:31I will make it grey also, perhaps a darker grey, and there's my scatter plot.
02:37And now what I can do is I can take my charts and I can export them.
02:41Now it's easiest to just simply export everything in the output.
02:45However, you may not want to have all of the texts and all the other information.
02:49So for instance, this right here is the Log. You see I click on that Log it's
02:53highlighted. I can delete that if I want.
02:56This is the title of the chart.
02:57We have something called notice that doesn't show up.
03:00It's there but it's hidden.
03:01I find it convenient sometimes to just come over here, and get everything I
03:05don't want and delete it.
03:07You can do that or you can leave it in. I will delete them for one and leave
03:11them in for the other.
03:13But what I am going to do now is I am going to save my output and then come
03:17to File, to Export.
03:19And what you have is a lot of options here.
03:21You can export them as a Microsoft Word document.
03:24You can download them, as an Excel file, as a PDF straight into PowerPoint.
03:30Now I personally find the easiest way of dealing with these is to export them as
03:34an HTML file because what that does is it exports the entire output as a single
03:39HTML file, but it also downloads all the graphics as individual files.
03:45You can do JPEGS if you want. On the other hand, if you're going to be putting
03:48this up on the web, PNG files can be more helpful.
03:51The entire output is a single HTML file and each chart is a separate PNG file.
03:56All I need to do now is tell it where I want to save things.
03:59I click on Browse and I created a folder already called SPS Output in HTML and PNG.
04:06So I am going to double-click on that and then I'll just call it Exported
04:11Output, press Save, and I will press OK.
04:14We have Exporting progress.
04:16There are times when that can take quite a while. This is a very short output.
04:20Now I'll show you if I go to the folder that I have created, SPSS output in HTML
04:25and PNG, I can double-click on that.
04:27Then you see we have an HTML file here. I double-click on that.
04:32This has the entire results.
04:35See these, for instance, are the notes that say they don't show but they are
04:39there, and it has its graphics also.
04:43On the other hand, I also have each chart as a separate PNG file right here and
04:49I can open it with the Windows Photo Viewer and there it is.
04:53In that way, I can take these graphics and put them into whatever program I want,
04:57how I feel like and best present them.
04:59That ends the final presentation on how to take the results of your analysis and
05:04get a way to present them to others that will make it easier for you to tell
05:08your analytic narrative, to make sense out of your results, to find surprises
05:12hopefully and insights that will give you an advantage in conducting your own
05:16work, and make it easier to sell your points to others.
Collapse this transcript
Conclusion
What's next
00:00So that ends our course on SPSS Statistics Essential Training.
00:05Thanks for joining me.
00:06I hope that this course has been insightful and enjoyable.
00:09I also hope that you've been able to expand your analytical abilities so that
00:13you're better able to work with critical data in your research and professional work.
00:17Now, here are some recommendations for further development.
00:20Your first stop should be the excellent help applications that are included in SPSS.
00:25These are more than just help files.
00:27SPSS also offers presentations to walk you through advanced procedures and
00:31provides illustrated case studies.
00:33I strongly encourage you to explore those resources and see how they can help
00:37you find ways to make the most of SPSS in your work.
00:40With that, it's time to let your data talk to you and for you to have some fun
00:45telling your own analytic narrative. Best of luck!
00:48We look forward to seeing you again soon!
Collapse this transcript


Suggested courses to watch next:

Excel 2007: Business Statistics (4h 19m)
Curt Frye

Excel 2007: Charts in Depth (3h 36m)
Dennis Taylor



Are you sure you want to delete this bookmark?

cancel

Bookmark this Tutorial

Name

Description

{0} characters left

Tags

Separate tags with a space. Use quotes around multi-word tags. Suggested Tags:
loading
cancel

bookmark this course

{0} characters left Separate tags with a space. Use quotes around multi-word tags. Suggested Tags:
loading

Error:

go to playlists »

Create new playlist

name:
description:
save cancel

You must be a lynda.com member to watch this video.

Every course in the lynda.com library contains free videos that let you assess the quality of our tutorials before you subscribe—just click on the blue links to watch them. Become a member to access all 104,141 instructional videos.

get started learn more

If you are already an active lynda.com member, please log in to access the lynda.com library.

Get access to all lynda.com videos

You are currently signed into your admin account, which doesn't let you view lynda.com videos. For full access to the lynda.com library, log in through iplogin.lynda.com, or sign in through your organization's portal. You may also request a user account by calling 1 1 (888) 335-9632 or emailing us at cs@lynda.com.

Get access to all lynda.com videos

You are currently signed into your admin account, which doesn't let you view lynda.com videos. For full access to the lynda.com library, log in through iplogin.lynda.com, or sign in through your organization's portal. You may also request a user account by calling 1 1 (888) 335-9632 or emailing us at cs@lynda.com.

Access to lynda.com videos

Your organization has a limited access membership to the lynda.com library that allows access to only a specific, limited selection of courses.

You don't have access to this video.

You're logged in as an account administrator, but your membership is not active.

Contact a Training Solutions Advisor at 1 (888) 335-9632.

How to access this video.

If this course is one of your five classes, then your class currently isn't in session.

If you want to watch this video and it is not part of your class, upgrade your membership for unlimited access to the full library of 2,025 courses anytime, anywhere.

learn more upgrade

You can always watch the free content included in every course.

Questions? Call Customer Service at 1 1 (888) 335-9632 or email cs@lynda.com.

You don't have access to this video.

You're logged in as an account administrator, but your membership is no longer active. You can still access reports and account information.

To reactivate your account, contact a Training Solutions Advisor at 1 1 (888) 335-9632.

Need help accessing this video?

You can't access this video from your master administrator account.

Call Customer Service at 1 1 (888) 335-9632 or email cs@lynda.com for help accessing this video.

preview image of new course page

Try our new course pages

Explore our redesigned course pages, and tell us about your experience.

If you want to switch back to the old view, change your site preferences from the my account menu.

Try the new pages No, thanks

site feedback

Thanks for signing up.

We’ll send you a confirmation email shortly.


By signing up, you’ll receive about four emails per month, including

We’ll only use your email address to send you these mailings.

Here’s our privacy policy with more details about how we handle your information.

Keep up with news, tips, and latest courses with emails from lynda.com.

By signing up, you’ll receive about four emails per month, including

We’ll only use your email address to send you these mailings.

Here’s our privacy policy with more details about how we handle your information.

   
submit Lightbox submit clicked