From the course: Machine Learning and AI Foundations: Clustering and Association

Looking at the data with a 2D scatter plot

From the course: Machine Learning and AI Foundations: Clustering and Association

Start my 1-month free trial

Looking at the data with a 2D scatter plot

- [Instructor] We're going to start with something pretty straightforward. We're going to simply open our data file and look at two of the variables visually using a scatter plot. And then, we're going to talk about what cluster analysis does with the data. So in the Resources folder, there is a data file that's all prepped and ready to go for cluster analysis called ReadyForCluster. That's the file we're going to use. Okay, let me orient you to the data file. We've got a series of variables. Product category, sales amount, entertainment. We have product category, game consoles, sales amount. So we have one, two, three, four, five, six, seven product categories in which we have the customers' total spend. And then, over on the right-hand side of the data file, we have the same seven repeated, but it's not in dollar amounts. It's a spending ratio. For now, we're going to focus on the dollar amounts, so in other words, kind of of the closer to the original raw data, not the ratios. Later in the course, we're going to learn two additional things about this data file. We're going to learn about why and cluster analysis, you're going to want to use the ratios and not the dollar amounts. And then, also, I'm going to be walking you through how I created this dataset from the raw data, which basically would be receipts generated at point of sale. Okay, so let's go into Graphs and Chart Builder. And we're going to do a very simple scatter plot here. If you're using a tool other than SPSS, for instance, you'll certainly be able to generate a scatter plot very easily. And we're going to go ahead and choose two of the variables. I've chosen them because they're rather different from each other and the pattern that they show in the data. Video Games and Hardware, just those two. And we're going to click on Okay. Here's our scatter plot. There are a couple of things about this data that makes for a rather unattractive scatter plot. It just doesn't look the way we expect a scatter plot to look. Certainly, this isn't showing a strong linear relationship between these two variables. But here's what's real-world about this. First thing that's real-world about it is that there's a lot of cases. And your real-world data's going to have a lot of cases too when you're looking at, even though this is customers and not transactions, it's going to make it hard to read a scatter plot like this. The other thing about it that makes it a little bit hard to see what's going on is that both of these variables are highly, highly skewed, meaning that there's tons of cases that have near zero spend and then a handful of outliers that are spending way more than everybody else. That is absolutely what you're going to see in your real data. This kind of point-of-sale data, of course, is not the only raw material you might use for cluster analysis, but real-world data tends to be skewed this way with lots of zeros. So let's add a little bit of visual help here to see what's going on. I'm going to double-click on this to put it into Chart Editor. And I'm going to go up to Elements, Fit Line at Total, but I'm going to make two requests. I'm going to ask for the mean of Y, 'cause I'm not looking for a regression line here. I'm not looking for a trend line. I want the mean of Y and I don't need the formula or anything. So that will be kind of a visual indicator of where the typical spend is. And then, I'm going to do the same for a vertical axis as well. I'm going to go ahead and add one. So I'm going to add a vertical reference line and I'm going to have that set to the mean. So this is really typical raw material that you might try to send to a cluster analysis. So again, what's typical about this is we have a ton of data piled up at zero, zero. So let's take a look at this visually. We've essentially broken the data into four categories or segments, haven't we? The first segment we can think of is all those folks that are near zero on Hardware and zero on Video Games. That constitutes a clump or a cluster of our data. The next group that we could describe is the folks that had high spend on Video Games, but near zero on Hardware. Then obviously, we have the folks that spend a lot on Hardware, but near zero on Video Games. Finally, we have this large area, where they've spent above average on Hardware and above average on Video Games. Let's pause for a moment and take stock. This all seems fairly straightforward and obvious, and you know, it is. Let's think about what we've just done. We're looking at just two variables and we're describing what dots are close to what other dots. Bottom line, folks, you don't run cluster analysis on two variables or frankly, even three. What cluster analysis is doing is using math to find the same kinds of groups and they're very often defined by very low, high, reverse high-low, right. But we're doing it on 10 or 20 or 30 variables, which is way beyond what we can do visually. So that's really all cluster analysis is. It's looking for what cases are proximate to which other cases, literally measuring the distance.

Contents