
Easytofollow video tutorials help you learn software, creative, and business skills.Become a member
In the last movie we covered SPSS's new Automatic Linear Modeling function, which takes a lot of the stress out of statistical analysis. It can also let you control almost everything manually should you so desire. On the other hand, you maybe using an older version of SPSS that doesn't have Automatic Linear Modeling, because that's something that's new with version 19, or you may want to include some options in your analysis that it doesn't have, such as something like Hierarchical Blocking, which I use frequently. In that case, you'll want to turn to SPSS's Standard Linear Regression function, which is what we'll discuss in this movie.
The goal of regression is pretty simple. Take a collection of predictor variables, multiply all of them by certain weights called regression coefficients, which are related to the impact that each variable has on the outcome. Add them all up and predict scores on a single scaled outcome variable. The actual work involved in this process can of course get much more complicated, but the general concepts remain the same. Now in this particular movie, we're going to look at the most basic form of multiple regression where all of the variables are entered at the same time in the equation.
It is after all the variable selection and entry that causes most of the fuss in statistics, and here's how it works. I'm going to be using the same Google Search data set that's similar to the marketing research people would be trying to do in terms of ways of determining the mind share of particular ideas in Google searches. What we need to do is go up to Analyze and then down to Regression, and we're going to go to the second choice here, Linear. Linear means straight line. It's going to try to put straight lines through the data, and what we need to do is get our one dependent or outcome variable, the thing that we're trying to predict.
I'll use interest in SPSS as a search term in Google, and then we pick the independent variables, those things that will be used to predict the levels. I'm going to use a bunch of other search terms from the Regression down through FIFA. I'm also going to use some dichotomous variables. Whether they have an NFL team, and NBA team or a Major League Soccer team. Put those in. Scroll down a little bit. The Percentage of the Population with a bachelors degree or higher, whether they have an outline for high school statistics, the Median Age.
Now in the Automatic Linear Modeling I was able to simply include a categorical variable of the Census Bureau region. It has four regions and that procedure, Automatic Linear Modeling, was able to compensate for the fact that we had four different categories of no particular order. In the Standard Linear Regression we can't do that. The predictors need to either be scaled variables, they can't be ordinal variables, or they need to be dichotomous, 01 indicator variables. Now when you have a categorical variable, you don't need the same number of indicator variables as you have categories.
The same way, for instance, to indicate gender as either male or female we only need one indicator. If we want to indicate four different regions in the United States, we only need three indicator variables, because if it's zero on all three of them, then the fourth category is implied. So I'm going to use these three indicator variables. Northeast, Midwest and South. I'm going to add those as well. Now let's come over for just a moment to Statistics and see if there is anything in here that we need for right now, and there isn't. There are times when having the R squared model change can be a very handy statistic, but we're using what's called Simultaneous Entry where we put everything in the model at once so there isn't a possibility of a change.
I'm going to hit Cancel. These are some diagnostic plots that we could get. I don't think we need any of those. If we wanted to save the predicted scores or other diagnostic statistics, we could do those with the Save menu. We don't need any of these for right now. Let's look at the other options. Now these are criteria that are used for entering and removing variables. Now we're not using an automatic procedure. We're simply entering everything at once. If we wanted to replicate the procedure that was used in Automatic Linear Modeling, we would use a Forward Stepwise Regression and then these criteria for entry would matter.
But now we're not going to worry about them. I'll just press Cancel now. And so really we're just using the defaults. I picked my one dependent variable, which needs to be scale variables, and then I put in a whole collection of independent variables, and now I'll press OK. And we get a bunch of tables out of this one. The first table, which indicates variables entered and removed, is not helpful. You can just ignore that. The second variable called Model Summary gives what's called the Multiple Correlation. The capital R in the second column tells you what the correlation is between all of the variables together.
It's an analog of the individual correlation, which is usually lowercase r. This is 0.937, which is a huge correlation, considering it goes from 0 to 1. The R squared, which is often a better indicator, because you can read it as a proportion of the variance in the outcome that could be predicted by the predictor variables, 88% is enormous. The next one, the Adjusted R squared, is also sometimes reported. You'll see that it's smaller. This has to do with the ratio of predictor variables to the number of cases.
Now truthfully, I've probably used more predictor variables than I should, because really I only have 51 cases, the 50 states in Washington, DC, but it still works for my purposes. The next table is the Analysis of Variance Table and that provides a statistical hypothesis test for whether the entire model as a whole can predict at better than 0%. And the answer of course is that yes. I'm looking at the number that's on the far right under Sig, where it says 000. If that number is less than 05, and this one isn't literally 0, it's just less than 001, then the model is statistically significant as a whole.
The table below that gives the actual regression coefficients. You have what are called Unstandardized Coefficients, which were in the original metric. So for instance, if it were years, that says for every year add this much more to your predicted value. If it were dollar, say for every dollar, then add this much to the predicted value. Now the Google Search terms, which are in quotes, those are already standardized ones, but if you go down to Has an NFL team or Has an NBA team. So the one that Has an NFL team is .068 and what that says is for a state that has an NFL team add .068 standard deviations to the prediction of their interest in SPSS relative to other terms in Google searches.
Next to those is the standard error, which is an indication of how spread out the variation is, and if you take the B weight or the regression weight and divide it by the standard error, you get to what's called a standardized coefficients or a beta weight. And those are actually really nice, because those are similar to correlations. They go from 0 to 1. They can be positive or negative and they indicate the degree of a linear relationship. Next to those are the Ttests. Those are individual inferential statistics for each one of the regression coefficients, and next to those is their significance level.
So we can go down to that column at the end, the Significance levels, and look for ones that are less than 05. We see for instance that Regression is a statistically significant predictor of interest in SPSS as a search term, so it's totally lost. And if we scroll down, we see that really those are only the two in that collection that do it. Now you may recall in Automatic Linear Modeling we had three or four that mattered, but that's because it used a different procedure where it was selective about what it entered and it also had a different criterion and we are seeing the overall changes in the information criteria.
This time we're just using probability values for individual regression coefficients. Now a really important thing here is the beta coefficients I said are like correlation coefficients. That's true to a certain point, but the big difference is that correlation coefficients are only valid on their own. Each correlation coefficient is calculated separately with the outcome. These, however, are only valid taken as a group; each one of these influences the other. So this can be very different from the correlation coefficients and it can be helpful to compare the two of them.
This is the most basic version of multiple regression. It doesn't have to be an impossibly complicated rocket science affair. Instead, it can serve a quick insight into what could be a large and very complicated data set. It can give you some real clarity to start with. The Automatic Linear Modeling function can do a lot of this and a lot more without too much direction from you, but there are situations where you would want to use the legacy command, and I especially find the standardized coefficients to be priceless, so I can compare them with correlation coefficients.
I recommend that you take a little time and see how SPSS's linear regression feature can help you deal with the complexities of your own data.