From the course: Spark for Machine Learning & AI

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Preprocessing the Iris data set

Preprocessing the Iris data set - Apache Spark Tutorial

From the course: Spark for Machine Learning & AI

Start my 1-month free trial

Preprocessing the Iris data set

- [Narrator] It's now time to download and preprocess a data set for our work with classification algorithms. We'll start by downloading the iris data set from the University of California at Irvine machine learning database. This data set contains data about three species of irises. The features are measurements of two parts of the flower, the sepal and the pedal. There is a length and width measurement for both the sepal and the pedal, creating a total of four features. The label in this data set is the name of the species. Now, I've already downloaded the data set, and saved it to my home directory, so I'll load it from there. I'll start pyspark, verify my directory, and start pyspark. Now, I am going to import a number of libraries that we'll be using during this preprocessing video. I'm first going to import from pyspark some SQL functionality, and I'll get that from pyspark SQL, and I'm also going to import the vector assembler, and another preprocessing tool, called the string…

Contents