Learn the commonly used K-means clustering algorithm to group subsets of data according to similarity.
- [Narrator] Let's work with the KMeans clustering algorithm. I'll start an instance of pyspark, and I'll clear the screen, and as usual, we'll import some code. The first package I want to import is the vectors from the linear algebra package. And I also want to import something called the vector assembler.
And then finally, I want to import the KMeans algorithm. Now, I have all the code I need to import, so I'm going to create a data frame with some data to work with. If you have access to the exercise files, then you'll be able to load the clustering data set csv file. I've copied it to my home directory, and I'll load it from there. So, I'll create cluster_df, or cluster data frame, and I'll reference the spark context, and read a CSV file.
This file has a header, so I'll specify header equals true. I also want to infer the schema, because I'm working with numeric data, so I'll specify inferSchema, and specify true. So, I've loaded the cluster data, and we'll see that it has three columns, and we can take a look at some of the data here. What you'll notice is, in the first 20 rows of data all of the values are between one and 10.
If I show more of the data, and I show for example all 75 rows, we'll notice that the data is grouped into three clusters. The first 25 rows or so all have values between one and 10. The second set of 25 rows all have data values between about 15 and 60. And then finally, the third set have data values between 60 and 100.
So, this naturally groups into three different clusters. Now, the next thing I want to do is do a little transformation to put these columns into a feature vector, and I'm going to do that by using the vector assembler. So, the first thing I'll do is clear the screen, and I'm going to create a vector assembler, which is one of those preprocessing transformations that's really quite handy. What it does is it takes a list of input columns, in our case, the columns are simply named col1 through col3.
And, I'm going to map that to a single vector column, and that output column will be called features. And now, I'm going to create a new data frame, which I'll call vcluster_df, and that's short for vectorized cluster data frame. And I'll call the vector assembler that I just created, and I'll apply transformation, and I'm going to transform the data that's in cluster_df.
So, now let's take a look at this vcluster data frame. What we have, I'll just scroll up a little bit here, is we have our three original columns, plus we have a fourth column called features, which is a vector, and that vector contains the values that were in col1, col2, and col3. Now, we've done that because the KMeans algorithm will work with that feature vector column. So now, let's set up the KMeans algorithm. First thing we'll do is we'll create an object called KMeans, and we'll call the KMeans to create a KMeans object, and we'll set the number of clusters, or the K to three.
Now, another thing we can do is to set the C, which determines where the KMeans algorithm starts, and this is useful if you're doing testing, to have consistency. So, I will set the KMeans to have KMeans, and set the seed to one. The next thing I'll do is, now we want to fit our data. So, we'll create a model, we'll call it kmodel, and this will be built using KMeans, and we're going to call fit, and the data we want to fit to this model is vcluster, because that contains the feature vector.
Now, you'll notice a couple of error messages, or warning messages here. This is just indicating that the BLAS library, which is a basic linear algebra library, wasn't able to load. That has no effect on the outcome. BLAS is useful for speeding up some linear algebra operations, but if not by any means needed for the volume of data that we're working with. So, I'm just going to clear the screen now. And now what I want to do is find the centers of these clusters. So, I'll create an object called centers, and I'll go to my kmodel, and call the function clusterCenters.
And let's take a look at centers. What you'll notice here are three points. These points represent the center of three different clusters. One cluster is centered around the points 35, 31, 34, so that's our mid cluster. Another is set around five, five, and five, that's our smallest cluster, our first set of 25 rows, and then the third center point is the center of the cluster for the final set of rows, which are values between 60 and 100.
Now in this case, the center point is around to the point 80, 80, 80. So intuitively, this makes sense. KMeans has been able to discover three clusters, and they've centered them as we would expect.
- Machine learning workflows
- Organizing data in DataFrames
- Preprocessing and data preparation steps for machine learning
- Clustering data
- Classification algorithms
- Regression methods available in Spark MLlib
- Common approaches to designing recommendation systems