Learn about how to perform a cluster analysis using Python and how to interpret the results.
- [Instructor] In OR, we grouped our customer data into three consumer cohorts for segmentation. And here in Python, we're going to crack the hood a little bit more on this overall concept. So, I've brought our packages in. Some of the usual suspects you've seen before in this course and you'll often use some of the pandas, numpy, netplotlib. There's also Archian's Algorithm. There are two different approaches our cluster analyzes can take, there's a flat cluster, which is where you can specify how many clusters you want.
And there's a taxonomy clustering where the algorithm decides for us. Our algorithm here, takes the former approach. Similar to what we did in OR, we're going to specify how many groups are made. So let's go ahead and bring our packages, so I'm going to shift, enter here. And let's connect to our data, so I'm going to select this second cell and shift, enter. And let's have a look at our data now real quick. So I'm going to type in myClusterData and the head command and pass in a value of three so we'll get three sample rows from our data set.
So that looks good, now there's a new concept at this stage of the game that I need to introduce. The fact that numerical data can be categorized as either continuous or discreet. So a quick definition for each. Continuous data can take any value within a range. A cost per acquisition is a good example because it can range from only a few cents to hundreds, if not thousands of dollars, depending on the business case. Discreet data, on the other hand, can be grouped into buckets and there are a finite number of them.
So the number of creative executions in a campaign is a good example because generally these are limited and categorized in some way. Generally speaking k means is going to provide the most value when you're working with continuous data. Let me give you an example, let's go ahead and plot two of our columns from our data set. I'm going to plot b1 and b3, so I'll do that with the plot command and we'll generate a scatter plot.
We'll call in our data here and specifically our subset b1 and then we'll plot that against b3. So my cluster data.b3 and let's go ahead and show that plot with the plot command, plt.show and shift, enter. So, what we're seeing here shows us that the data that we just plotted is discreet.
There are a finite number of values in these two columns. So this would not be a great candidate for a k means. Again, we're going to get under the hood here. So instead of just calling our data in, I've gone ahead here and explicitly stated the data so that we have an array of numbers for x and another for y. This is a small sample from our b3 column on the x and cta for the y, so let's go ahead and run this. And let's go ahead and plot these values now.
So I'm going to type in plt again and we want to see a scatter plot and we want to see that scatter specifically for the x and the y values that we just loaded in, in the previous cell. Let's go ahead and show that and shift, enter. So, here we can see in our plot that we have a bit more variety, our data seems to be much less categorical. And we can see what appears to be these random groups. This is the shape of the data that tends to work best for a cluster analysis of this sort.
Now, I've gone ahead and sorted the data from our x and our y values from above. And have organized those into this two-dimensional array that you see here. Now, we did the same sort of transformation on the data in r, it's just that our algorithm managed all of that for us. And here again, what we're trying to do is pop the hood a little bit so you can see a bit more of what goes into this process to create these groups. So let's go ahead and run this, to load those data points in. And now let's write a set of procedures that are going to do a few things.
We're going to specify how many clusters we want our algorithm to generate. We're going to run that algorithm, we are going to assign our centroids because I'd like to visualize those, so you can see how those work. And then, we're going to label our group names. So well, let's go ahead and declare our variable called my groups in and tell our k means algorithm package that we want three clusters. So, to do that, we're going to do something like this. We're going to do myGroups = KMeans: that's from our k means declaration above, and we're going to specify the number of clusters by typing in n_clusters and we're going to specify three.
Let's now run that algorithm, so we type in myGroups.fit and we're going to pass it, that value of x, which is what we called our two-dimensional array above. Next we're going to create a variable that takes the output from our algorithm and assigns the centroids. Now we mentioned those in the OR video, but now you'll be able to visualize them. And when we visualize the output from our algorithm, you'll see them as visualize on the screen.
So we type centroids, that's declaring a variable to this command of myGroups, assigned to cluster centers. So that takes the cluster centers attribute and we generate those, and now let's assign the labels from our definition of myGroups above. So we'll do that with labels = myGroups.labels_ and we'll run this, our next cell creates a for loop and it plots each point on the graph.
So, all the commands that we just wrote, this set of code here will essentially allow for us to visualize all that. So then it plots the centroids so we can see those again. So we've declared a color palette here. We have then this for loop that will go through and plot each of our data points and then we'll visualize those centroids. So, I'm going to go ahead and run this. So, X marks the spot, and we can see that each of our three groups are organized around their respective centroids, which is the mean in k means.
It establishes the best fit by calculating and declaring the most efficient mean for the number of groups we told it to create. In a later video, we will discuss the best practice of running market tests. And the output from our clustering algorithm can provide us with a hypothesis that we can test. Which would be the next step in determining the efficacy of our segmentations.
In this course, discover how to gain valuable insights from large data sets using specific languages and tools. Follow Chris DallaVilla as he walks through how to use R, Python, and Tableau to perform data modeling and assess performance. As Chris dives into these concepts, he shares specific case studies that come directly from his own work with clients. Plus, he shares three essential—and practical—best practices for data-driven marketing that you can use to bolster your organization's marketing performance.
- Installing R, Python, and Tableau
- Navigating the UI for R, Python, and Tableau
- Using R, Python, and Tableau
- Exploratory analysis
- Performing regression analysis
- Performing a cluster analysis
- Performing a conjoint assessment
- Stakeholder alignment