From the course: Spark for Machine Learning & AI

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Bucketize numeric data

Bucketize numeric data - Apache Spark Tutorial

From the course: Spark for Machine Learning & AI

Start my 1-month free trial

Bucketize numeric data

- [Instructor] Now let's take a look at how we can organize continuous ranges of data into buckets or partitions. First, I'll verify my working directory and I'll start pyspark. I'll use ctrl+l to clear the screen, and I'm going to import some code that we need and I'm going to find this in pyspark.ml.feature and from there I want to import the transformation called Bucketizer. Now Bucketizer allows us to group data based on boundaries, and so I need to provide a list of boundaries for Bucketizer to work with. So I call those boundaries splits. And I'm going to provide a list of what these splits are. Now at the lower end, I would like anything starting at negative infinity to go in the first bucket. So to specify negative infinity, I use this syntax, minus float, quote inf, and from negative infinity up to -10 will be one bucket and then from -10 to zero will be another bucket from zero to 10 will be my next bucket and everything that's greater than 10 and up to positive infinity…

Contents