Learn how to import the MLlib package for regression analysis and create a label and feature.
- [Instructor] One of the key steps in doing machine learning is actually getting your data ready to be used. So we're going to take a look now at preparing data for machine learning. Step one, we're going to pull in some external data, we'll read that in and cleanse it, clean up any things or any irregularities that we might find, we'll aggregate it and convert it to a format that we can use, and lastly we'll build features and labels which we use for regression analysis. I'm over here in databricks now and I've loaded the exercise file, 4.2, Preparing Data for Machine Learning.
Now, if you are looking for a cluster and you don't have one because it's died, in the community edition, after 60 minutes of inactivity it will terminate. So you can create a new one directly from here. And for this one in the machine learning example, we're going to want, a version that is 1.6. And I do that because not everything from version 1.6 for machine learning has been ported over to the latest version of Spark 2.0.1. So I want to teach it in this version so that way it's clean and you can follow along until a later date when all of the functions and everything have been ported over.
So once you've created the cluster there, you can select it and see that it's pending, and wait for that to finish before we move on. And with our cluster created, first, as I mentioned, we're going to download some external data. So here we're using another magic function, the percent sign and "SH." That's for a shell command, something you would be executing on your terminal window or command line if you're familiar with Windows. And we're going to issue a curl command here. Curl is a really old program, where it basically just goes out to a specific location and retrieves whatever is at that location.
So here, what we're doing is saying "Curl-0," and giving it a location for a CSV file that we're going to download. So we're actually going to go pull out and download that data. I hit Play on this, and you can see the output down below. So it actually went out and downloaded that data for us. Now, that data is stored in the databricks slash driver location. And we can go through and try to find that if you want. You could go take a look at any other files in there, but this is how you access that data. I have my path. Now I'm going to actually read that data in to a data frame here.
So, sqlcontext.read.format, giving it CSV. Then I have a backslash here to add a new line character. And I'm going to tell it there is a header, and further schema, and then load the path that I created already. Next I'll cache that, so it will actually put it in memory so we can reuse it and kind of perform operations much faster. We'll call the dropna function, which will actually remove missing value. So we're cleansing our data a little bit. And then we'll display the results there at the bottom. Okay.
So let's hit Play. You can see the Spark jobs executing. And when they're done, we have our data set. So this is how we know that we have something that is going to be good to work with. Next, we're going to aggregate and convert it. So, just like before, we're going to use the data frame API here to actually perform some operations. We're going to create a new data frame called summary. And we're going to select OrderMonthYear, and SaleAmount. And then we're going to group by OrderMonthYear, summarize it, order by OrderMonthYear, and then convert it to a data frame.
So we're performing several operations here in one line. Now that we've pulled that data in and aggregated it, we have the monthly sales totals. Next, we want to convert that first one, and just get the year. So, we have a little function here, and we're doing a map, which is kind of like a for loop in Python. And we have a lambda function, which passes in the string, the row itself, and then from there we convert using the int function. Then we take the OrderMonthYear and replace any dashes. So, whereas before, you had something like 2010-01-01.
So we're going to remove all of that so it's just numbers. And then the next one, return the SaleAmount, and lastly convert that to a data frame. So we'll take a look at the results here. We've got two jobs that took about four seconds, and we now have essentially two different data frames that we can use to do our machine learning. Okay, last step is to actually convert the data frame to features and labels. And this is something that is specific to regression. But it's what we're going to use throughout the rest of this chapter. So, we need to import from this pyspark.mllib.regression module.
We're going to import LabeledPoint, which is a method that we can actually use to come with the labels and points that we use to do linear regression. So, we import that module, and then what we do, we create a new data frame. We take the results, the one from above, that we aggregated and converted everything to integers, and we use this LabeledPoint function here to create the actual features that we're going to use to then perform our regression. Lastly, we say toDF, which converts it to a data frame, and we display our results.
I click Play on this. You can see the results there are a little bit different than what you might expect. And that's exactly what's happening, or what we wanted to happen. Where we created a feature, and in the feature, for the values, we have that integer which represents the year, month, and day of our actual sales amount, which is now found in the label column.
- Understanding Spark
- Reviewing Spark components
- Where Spark shines
- Understanding data interfaces
- Working with text files
- Loading CSV data into DataFrames
- Using Spark SQL to analyze data
- Running machine learning algorithms using MLib
- Querying streaming data
- Connecting BI tools to Spark