Easy-to-follow video tutorials help you learn software, creative, and business skills.Become a member
In the last section, we looked at ways to chart the relationship of three or more variables at a time. In this section, we'll look at ways to give precise numerical descriptions to those relationships as well as inferential tests to check the reliability of our numbers. The very first procedure that we're going to cover here is one of the most impressive features that SPSS has added for version 19. It's called Automatic Linear Modeling. It's a huge step towards making data analysis a little easier, a little more accurate, and a lot more interpretable for a lot more people.
Don't worry if you have an earlier version of SPSS. I'll also show you how to accomplish the same goals using procedures that are available in every version of SPSS in the next video. The goal of SPSS's Automatic Linear Modeling function and linear regression in general is to have an entire group of predictor variables. This can be scale variables, or ordinal, or dichotomous indicator variables. That's the 0/1 variables. You can even use multiple group categories if you break them down into a series of dichotomous variables.
But the goal of linear regression is to take these predictors and find the best way to combine them to predict values on a single scaled outcome variable. While the mathematics behind this can get very involved and there are plenty of decisions that can be made, the Automatic Linear Modeling procedure has been developed to keep most of that in the background and to let you focus on interpreting your data. This is how it works. To get to the Automatic Linear Modeling, we first go to Analyze, then down to Regression, and then over to Automatic Linear Modeling, which is the first choice.
From this, SPSS takes the information that we gave it about the variables about whether they were predictors. That is, they were input variables or whether they were targets or whether they were both. So this is a situation where the role that we gave a variable in the dataset makes a difference in how things work out. The first thing we need to do is pick our target variable. I'm going to use searches for the term SPSS. That will be my target variable. Now, it's going to ask me what I want my predictor variables to be.
I'm going to add a bunch of these ones about other searches in Google. I can put those in here. I can leave those in with the other indicators about whether they have an NFL team, or an NBA team, or a Major League Soccer team. I can have this information about Census Bureau Region. I'm going to remove these four about Census Bureau Division, because that's just subcategories of the region. So I'm going to remove that. Then these three, Northeast, Midwest, and South, are indicator variables that I use for the region.
However, the nice thing about Automatic Linear Modeling is you can put categorical variables with several categories in them and it will break them up in a way that makes best sense for the data. So you can leave categorical variables in there as they are. I don't need these dichotomous ones as a backup. So this is the list of potential variables that I can use as predictors, to try to get the relative importance by a state of SPSS as a search term in Google. I'm then going to come up here to Build Options.
It has been our objective and we have a creative standard model. That's what we're going to do. The other ones that are called Boosting, and Bagging, and the Large Datasets, those are technical things that we don't need to worry about. However, I am going to come to Basics, and this is asking me whether I want it to automatically prepare data and truthfully, this is a wonderful thing. It's a great way to deal with outliers and to transform variables and to make substitutions and it's one of the big perks of the Automatic Linear Modeling approach. The next thing I'm going to go to is Model Selection.
This is where things can get very complicated in regression. It's asking the Model Selection Method. That is, how it decides which variables to put into the regression model. I have several options. Forward Stepwise. I'll say one that says just put them all and then leave them there, and another one called Best Subsets. Now, when we get to the Linear Regression Command that's separate from this one, you'll see that we have some different options. I'm just going to leave this at Forward Stepwise, because it can make life a little bit simpler.
There is also an issue here about what criterion it wants to use. There are several choices here. The AICc, there is also the F- statistic, and adjusted R-squared. Let's not worry about that. Let's just use the Information Criterion. Then we can ignore these other options, and then these ones are about Ensembles and about Advanced, we can just ignore. So the last thing I need to do is going to go to Model Options and we don't need to worry about these options. We can just leave the defaults here. So now we can come down to the bottom and we can press Run to see what it gives us.
Automatic Linear Modeling produces this one small chart and it doesn't look like a huge amount, but this is a Model Viewer. When you click on it, it's interactive and it does a lot of other things. So I'm going to double-click on this to open up what's called the Model Viewer window. Maximize that. What you see here is first it says what's the target variable, the thing that we're trying to predict, and that is SPSS and its relative importance as a search term in Google on a state-by-state basis.
The Model Summary also tells us that it's using automatic data preparation and it's using a Forward Stepwise model selection method for deciding which variables go into the model. Now, the bottom one the information criterion has a number. That's not really inherently meaning in and of itself, but the lower the number, that is, we have negative numbers, so the greater the absolute value of the negative number, the better the prediction. Beneath that, where you show that we're able to predict about 79% accuracy in this model. So that's good.
What I'm going to do now is I'm going to come over to the little list of thumbnails on the left and start going through these one at a time. That's the one we're at right now. The second one shows what the Automatic Data Preparation did and what it is, is that we have a lot of outliers and what it's done is it's trimmed the outliers. Actually, it didn't really trim them, because trimming means throwing away that data. Instead, technically what SPSS did is something called Winsorising where it takes the outliers scores and simply replaces them with the highest or lowest non-outlier scores.
So it brings them in. This is a non-uncommon practice in business setting, so it's a nice way to do it. Also, when we have categorical variables like the Region, SPSS is able to merge categories in a way that maximizes their predictability. So that's a nice thing. So that's what the Automatic Data Preparation has done. The third window shows us what's called Predictor Importance. Predictor Importance is actually a rather sophisticated statistical calculation.
There are a number of things that go into it. It's not just a matter of probability values. It's not just a matter of correlations with the outcome. There is much more to it than that. But the relative importance is a very easy thing to understand. What this is telling us is that there are three variables that have a lot of importance in explaining the levels of relative interest in SPSS as a Google search term. The first is the use of Regression as a search term. That's not surprising, because that's a major thing that SPSS is used for.
The second one amazingly is Totally Lost, which seems to show up a lot with SPSS. The third one is the percent of population with a Bachelor's degree or higher. So these are the three major variables. We're going to have more about those. The next chart is the Diagnostic Plot. It lets us know the observed value of SPSS interest for each of the 51 states in Washington, D.C., along with its predicted value. The idea here is that they should stay close together, that the observed and the predicted should be pretty close. Otherwise we don't need to worry about this.
This is a histogram of Residuals. That's how far off the predictions were. Again, if we had a thing that looked really unusual here like a big spike at one end or the other, we might have a problem, but we're not going to worry about this one. I'm going to scroll down a little and I'll go to the next little page. This is a list of particular outliers and it tells us what their score was. For instance we had one place that had a score on SPSS of 3.364 and what that means is that state showed a relative interest in SPSS as a Google search term that was 3.364 standard deviations above the national average.
There is another measure that's related called Cook's Distance and this doesn't necessarily mean that these were outliers in this absolute sense, but they are the most extreme cases. The next one down is a graph of the effects of various predictor variables. We have Regression as a search term but transformed because it's removed the outliers and then Totally Lost and then Degree was also transformed by removing outliers. This is a Diagram View. You can also get a Table View and you can even expand this to see the various terms.
If you need an analysis of variance table for whatever purpose, here it is. I'm going to skip over to the next box and here we have coefficients. The coefficients are the actual numbers that you use to multiply things by. The Intercept is in there and then we have Regression, and Totally Lost, and Degree. Please note the Degree 1 is a different color because it's a negative coefficient. This would become clearer if we come down and instead of having the diagram we look at the table. Here, we can now see the coefficients.
The Intercept, that is the standard value that we give to everybody, is 0.87. So we assume that a state is 0.87 standard deviations above the mean in their interest in SPSS. Then for every standard deviation above on Regression, we add another half of standard deviation. For every standard deviation above on Totally Lost, we add a little over a half 0.58. On the other hand, for every percentage point of the population that has a Bachelor's degree or higher, we subtract 0.03 standard deviations, and so this is another way of looking at the relative contribution of the variables.
I am going to scroll down a little further. We have another one here that gives estimated means charts and these are straight lines, because these are just the slopes of the lines that we give in the coefficients. I don't think there is anything terribly important there, so I'll skip to the next one. This is a table that shows us the three variables that got included and then across the top is the information criterion and you can see that the number goes down. It charts at -52 and when they add Totally Lost, it goes to -73. Now, it adds Degree. It goes down to -75 and that was the criterion for deciding whether to include a variable, is whether it lowered the value on information criterion.
The very last thing is just a quick summary. You can click on to see what got included and what the options were. Just a quick written summary of the entire model. So the Automatic Linear Modeling function in SPSS is a fabulous option for those who want to make a sophisticated analysis and have thorough reporting options without having to make a million decisions on their own. It makes it much, much easier to sift through a large dataset and see what useful patterns might emerge.
I encourage you to spend some time to check out all of its options because there is more than I've covered here and explore how it might be able to help you in understanding your own data.
Access exercise files from a button right under the course name.
Search within course videos and transcripts, and jump right to the results.
Remove icons showing you already watched videos if you want to start over.
Make the video wide, narrow, full-screen, or pop the player out of the page into its own window.
Click on text in the transcript to jump to that spot in the video. As the video plays, the relevant spot in the transcript will be highlighted.
Your file was successfully uploaded.