From the course: Spark for Machine Learning & AI

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Introduction to preprocessing

Introduction to preprocessing - Apache Spark Tutorial

From the course: Spark for Machine Learning & AI

Start my 1-month free trial

Introduction to preprocessing

- [Instructor] There are two types of pre-processing, numeric and text pre-processing. Normalizing maps data values from their original range to the range of zero to one. It's used to avoid problems when some attributes have large ranges and others have small ranges. For example, salaries have a large range, but years of employment has a small range. Standardizing maps data values from their original range to a range of negative one to one. And it also has a mean value of zero. This transformation creates a normal distribution with a standard deviation of one. Now this transforms our data into a bell curve shape formation. It's used when attributes have different scales, and the machine learning algorithm you're using assumes a normal distribution. Partitioning maps data values from continuous values to buckets, like histograms. Deciles and percentiles are examples of buckets. It's useful when you want to work with groups of values instead of a continuous range of values. If you work…

Contents