From the course: Scala Essential Training for Data Science

Unlock the full course today

Join today to access over 22,500 courses taught by industry experts or purchase this course individually.

Getting Started with Spark RDDs

Getting Started with Spark RDDs - Scala Tutorial

From the course: Scala Essential Training for Data Science

Start my 1-month free trial

Getting Started with Spark RDDs

- [Instructor] Spark has a data structure called the Resilient Distributed Dataset, or RDD for short. These are immutable distributed collections. They're organized into logical partitions, and they're a form of fault-tolerant collection. Data in resilient distributed datasets may be kept in memory or persisted to disk. RDDs are like parallel collections in a lot of ways. They're groups of data of the same type or structure, the data is processed in parallel, and RDDs are generally faster than working with sequential operations. Now, there are some differences between RDDs and parallel collections. RDDs are partitioned by a hash function. Parallel collections are broken into subsets and distributed across cores or threads within a single server at run time. Now, RDDs are distributed across multiple servers. Parallel collections work across a single server. Within RDDs, the data can be easily persisted to permanent storage while working with the RDD. RDDs are broken up again into…

Contents