From the course: Cloud Hadoop: Scaling Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Spark data interfaces

Spark data interfaces - Apache Spark Tutorial

From the course: Cloud Hadoop: Scaling Apache Spark

Start my 1-month free trial

Spark data interfaces

- [Instructor] As I just introduced in the previous movie, there are now three Spark data interfaces and we'll be working with all three in subsequent movies. The first is the RDD, the resilient distributed dataset. This is the original low-level interface to a sequence of data objects. Most commonly, it's placed in memory. The next type of interface is the DataFrame. This allows for a collection of distributed row types, similar to objects in R or Pandas machine learning languages. The new Spark data interface is a Dataset. This is available as of Spark 2.0 or greater, and it's a distributed collection, which combines the DataFrame and RDD functionality. So if you've actually been working with Spark a little bit or maybe discovered it before this, a recommendation that I'm seeing more and more is to focus on coding around the Dataset, because it is a more complex and richer abstraction.

Contents