Learn how to use multiple cores in CPUs to process large data sets.
- [Instructor] Let's consider the advantages of parallel collections. Multi-core processors are common today. Many desktop machines have two or four cores and servers typically have multiple times as many. Scala makes it easy to take advantage of multiple cores and hyper-threaded processors with the use of parallel collections. A common programming practice is to use for-loops to process each element of a collection, one at a time. This works well for small collections, but when we have thousands or more items in a collection, the processing time can begin to add up.
Like an assembly line, we can process data in a collection faster if we work on multiple elements at a time. A parallel collection is a collection that allows us to do just that. Let's consider a case where we have an array of 1000 numbers and we need to multiple each number by two. Let's say we use a for-loop and multiply each number one at a time. Then it will take, let's say, a thousand units of time. Now if we split the array in two, and process both collections at once, we could finish in 500 units of time.
On a quad processor with hyper-threading, we could run eight processes in parallel and finish the task in 125 units of time. The primary advantage of using parallel collections is that it allows us to finish computation faster than we would with sequentially processed collections. Another advantage is ease of use. Other programming languages have support for parallel processing, but Scala makes parallel processing as easy as sequential processing.
The overhead of using Scala parallel collections is fairly low. For some collection types, using the parallel collection version does not incur any noticeable overhead when compared to using the sequential version. Scala has a variety of parallel collection types, including the parallel array, or ParArray, ParVector, ParHashMap, and ParSet. Additional parallel collections are described in the Scala documentation. In our discussion here, we'll focus on using parallel arrays and parallel vectors.
Dan also focuses on using Scala with Spark, a distributed processing platform. He first describes how to work with Resilient Distributed Datasets (RDDs)—a fundamental Spark data structure—and then explains how to use Scala with Spark DataFrames, a new class of data structure specially designed for analytic processing. He wraps up the course by providing a summary of advantages of using Scala for data science.
- The advantages of Scala for data science
- Scala data types
- Scala arrays, vectors, and ranges
- Parallel processing in Scala
- Mapping functions over parallel collections
- When and when not to use parallel collections
- Using SQL in Scala
- Scala and Spark RDDs
- Scala and Spark DataFrames
- Creating DataFrames