From the course: Spark for Machine Learning & AI

Gradient-boosted tree regression - Apache Spark Tutorial

From the course: Spark for Machine Learning & AI

Start my 1-month free trial

Gradient-boosted tree regression

- [Instructor] Now in this video, we're going to continue from where we left off in the previous video on decision tree regression. What I'd like to do here is introduce gradient boosting tree regression. Now in addition to introducing another algorithm, I also want to demonstrate how quickly we can evaluate different algorithms once we have our data loaded and set up into our various data frames. So first thing I'll do is import the GBT regressor. Now I have my regressor imported, I have my code. So I'm going to create an instance of that regressor, and I'll simply call it GBT. And I need to specify our features column, which if you'll recall is called features. And we also need to specify our label column, and in our case that is PE. Now I'm going to create a gradient boosting tree model by calling the GBT instance we just created, and I'm going to fit a model using our training data. So now we have our model created, which means we can create predictions. And we create predictions by calling our GBT model, applying the transform to our test data. Now we have our predictions, so we can evaluate them. Now we have evaluators already created, like the DT evaluator, which we could use, but I like to keep my naming conventions in a specific way. So I'm going to create a GBT evaluator, and I'm going to call the regression evaluator, and I'm going to specify the label column, which is PE, and our prediction column, which is prediction, and our metric name, which is root means squared error, or RMSE for short. So now we have our evaluator. Now I'm going to get the root means square error from my gradient boosting tree model. And I'll save that in a variable called GBT_RMSE, and I'll get this by calling our evaluator, and passing in the predictions to the evaluate function. And now let's look at the root means squared error. We notice it's about four, so it's a little better than both the decision tree and the linear regression. So it's slightly better than the other two regression methods. But the thing that I really want to emphasize, is that in just several lines of code, we're able to evaluate yet another regression algorithm. So keep this in mind as you're working with regression algorithms and classification algorithms. It's very easy to experiment with multiple ones, and see which works best for your data sets.

Contents