Understand how to filter collections in parallel.
- [Instructor] Sometimes, when we have large collections we want to filter them. Scala makes it easy to filter collections so you can find all the members of a collection that meet some criteria. So for example, let's create an array of numbers. We'll create val v. And we'll make this one to 10,000. And let's make it an array. And let's create a parallel version by using the par method. Now let's just check the length of the collections.
V.lengths and pv.length. OK, they're the same. Numbers appear to be the same. So we'll just clear the screen, and we'll move on to our next step. So we have a collection of 10,000 elements. What I'd like to do now is create another value that has the elements from pv, the parallel vector, that are greater than 5,000. So I'm going to make a new value, and I'll call it pvf for the filtered version of pv. And I'm going to define that as pv.filter.
So I'll apply a filter. So for each element of the collection, I want to do a test and see if it is greater than 5,000. And now I have a value called pvf, which contains all the values greater than 5,000. And if we check the length, we'll see that it's 5,000 long. Which is what we would expect. Scala parallel collections also have a filter not method for applying the negation of the filter.
So let's create another val pvf2, we'll call it. And for this we will use pv.filterNot, apply to all the members of the collection and the condition is greater than 5,000. So we would expect all the values that are less than 5,000. Which it appears we have that. Let's just double check the length to make sure we got them all.
Great, we have 5,000. The filter and filter not methods can take custom functions that return a boolean value. Let's define a boolean function that takes integers as an input and returns a boolean value. So first I'll clear the screen so I have a little room to work. Now I'm going to create this function called div3 and I'm going to pass in a single integer. We'll call that x. Now the function itself will return a boolean, so we'll specify that.
And that will define the function and we will say that we have a value called y, which is an int, and it's defined to be x, that perimeter we passed in, ma-cha-lo arithmetic 3. So that essentially gives us the remainder of division by three. And then we return the value of a relational check. We want to know if y is equivalent to zero. 'Cause if it is, then we have a number that's divisible by three.
So let's just check that. Let's call div3 with three. Truth, good. That's divisible by three. Div3 of nine should also be true. Yup. But div3 of five should not. Great. So div seems to be working correctly. Now let's apply it to our value pv and let's filter using div3. And again, want to apply this to each member of the collection. So we'll use the anonymous symbol, underscore, and then execute and you'll notice all of the values that are returned in the par array are multiples of three.
So the filter and not filter methods are handy ways of selecting a subset of elements from a parallel collection.
Dan also focuses on using Scala with Spark, a distributed processing platform. He first describes how to work with Resilient Distributed Datasets (RDDs)—a fundamental Spark data structure—and then explains how to use Scala with Spark DataFrames, a new class of data structure specially designed for analytic processing. He wraps up the course by providing a summary of advantages of using Scala for data science.
- The advantages of Scala for data science
- Scala data types
- Scala arrays, vectors, and ranges
- Parallel processing in Scala
- Mapping functions over parallel collections
- When and when not to use parallel collections
- Using SQL in Scala
- Scala and Spark RDDs
- Scala and Spark DataFrames
- Creating DataFrames