From the course: Cloud Hadoop: Scaling Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Spark data interfaces - Apache Spark Tutorial
From the course: Cloud Hadoop: Scaling Apache Spark
Spark data interfaces
- [Instructor] As I just introduced in the previous movie, there are now three Spark data interfaces and we'll be working with all three in subsequent movies. The first is the RDD, the resilient distributed dataset. This is the original low-level interface to a sequence of data objects. Most commonly, it's placed in memory. The next type of interface is the DataFrame. This allows for a collection of distributed row types, similar to objects in R or Pandas machine learning languages. The new Spark data interface is a Dataset. This is available as of Spark 2.0 or greater, and it's a distributed collection, which combines the DataFrame and RDD functionality. So if you've actually been working with Spark a little bit or maybe discovered it before this, a recommendation that I'm seeing more and more is to focus on coding around the Dataset, because it is a more complex and richer abstraction.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.