From the course: Machine Learning & AI Foundations: Linear Regression

Building effective scatter plots in Chart Builder - SPSS Tutorial

From the course: Machine Learning & AI Foundations: Linear Regression

Start my 1-month free trial

Building effective scatter plots in Chart Builder

- [Instructor] Okay, let's get started by talking about scatter plots. Our broader subject is simple linear regression, which is the prediction of one scale variable with one other variable, and there's no better way to do that than scatter plots. So, in our resources folder, there is a file called Auto MPG Modified, and we can just simply double-click on that, and that's gonna launch SPSS. And you'll notice as it's loading, that it says IBM SPSS Statistics Subscription. IBM started offering the subscription with version 25, but everything I'm gonna be showing you would apply with any recent version. Okay, so the first thing we're gonna do is go to the data window. And let's take a look at this dataset. Auto MPG is a modified version of a file that I got from the well-known and very useful UCI data repository. Here it is. I've just made a couple of minor modifications to it, and that's what we're gonna be working on this scatter plot. I really recommend this repository. It's a great source for practice files. Let's take a quick look at the file. We see on the far right-hand side we have the name of the vehicle, and we've got miles per gallon, cylinders, displacement, and several others. What we're gonna do is pretend that our focus at the moment is predicting miles per gallon, so that will be our dependent variable, using one of the other variables. And I'm gonna go ahead and choose weight as my single independent variable. SPSS is a large, complicated software, so there's often a lot of options for doing the same thing. I'm gonna recommend Chart Builder, and that's what we're gonna use. Let's briefly talk about this warning message, and then I'm gonna choose the selection to not show this again. It reads, "Before you use this dialog, "measurement level should be set properly "for each variable in your chart." Let me check off "Don't show this dialog again," click on OK, and I'm gonna briefly cancel out of Chart Builder, and walk you through what it's talking about. See the symbols next to the variables here. We have miles per gallon, it has a ruler next to it. Cylinder has some circles. Model year has these three bars. It's terribly important that those are declared properly. If you go to the Variable View, you can see where it can be declared. The chart builder will automatically adjust the settings based upon these variable types. A scale variable is like height and weight, where decimal places and so on make sense. Nominal variables are separate and distinct categories, and ordinal variables like model year are also separate and distinct categories, but where they're meaningfully ranked. Since we don't talk about a model year like 85 1/2, it really should be declared as ordinal, and not as scale. If we return to the chart builder, those same symbols are visible here, and we can start making our scatter plot. So, we're gonna down to Scatter/Dot, and there's a new feature in version 25, there's a choice, simple scatter plot with fit line. You may find that that choice is not available to you. If not, don't worry about it. You can add the fit line in a later step. But I'm gonna choose this one. Okay, so I've dragged the scatter-plot symbol up to the canvas, and now I'm gonna choose miles per gallon as my Y, and weight as my X, I'm gonna drag it over here. Now, your dependent variable always goes into the Y-axis, and your independent variable always goes into the X-axis. It's just a rule, it's always done that way. Your audience would be very disoriented by any report where you didn't follow that convention. I'm just gonna go ahead and click OK now. And congratulations, we've made our first scatter plot. So, let's just briefly pause and kind of observe here. We can see that there's a regression line that's been added, that thin line that's going diagonally through it. It seems to fit the data reasonably well, but there's a bit of a curve here, so it's worth investigating some of the many Chart Builder features that allow us to dig deeper, and further investigate into this scatter plot. So, what we can do is double-click on the chart, double-click on the line, and we see, I wasn't close enough. Double-click on the line, there we go. And you see that a linear line is not the only choice. We could see if a quadratic fit was a better fit. It looks like, in this case, it might possibly be. We're not gonna further investigate curvilinearity now, but there's lots of features hiding inside Chart Builder that you can interact with, that help you explore and better understand your data. Let's take a look at another one of those features. I'm gonna close this, return to Chart Builder, and now do a colored scatter plot. Drag the colored symbol up, and I'm gonna make origin the color. Just as it's important that you declare nominal, ordinal, and scale, you must also label your data when necessary, and I've added labels for Japan, Europe, and USA. I did that step, the raw data did not have it. By adding that color, we actually can see a pattern right away. There's almost no red or green dots above 3,500 in weight. All of the European and Japanese cars tend to be lighter. Now, there's a number of light USA cars, but the heavy cars seem to all be American cars. There's other features hiding within Chart Builder that can help us further understand this. For instance, we've seen this filter region over on the right-hand side. I can revert back to a black-and-white chart, and drag origin into that area, and tell it that I wanna see only Japan. I get a very different scatter plot. I could use other variables, and filter in that way. Let's do one final variation on our scatter plot. I'm gonna remove origin from the filter, and I'm gonna go to Groups and Point ID, and I'm gonna ask for a row panel variable. And now I'm gonna try origin over here, and I actually get three different scatter plots, showing each of the three regions separately. Chart Builder can be a powerful way to explore your data, and you wanna always begin using visualization, particularly when you're building a more complicated regression.

Contents