From the course: Apache PySpark by Example

The DataFrame API - Spark DataFrames Tutorial

From the course: Apache PySpark by Example

Start my 1-month free trial

The DataFrame API

- [Instructor] There are two main APIs that we'll be looking at in this course, DataFrames and resilient distributed datasets, or RDDs. The DataFrames are the high level APIs and the RDDs are the low level APIs. DataFrames are easy to get started with and cover a good chunk of what you'll need to know on the job. Once you are comfortable with DataFrames, we'll look at RDDs. Now when Spark was first open source, Spark enabled distributed data processing using RDDs. This provided a simple API for distributed data processing. So big data engineers who are familiar with MapReduce jobs could now leverage the power of distributed processing using general purpose programming languages, such as Java, Python, and Scala. Now the challenge was that if Apache Spark wanted to attract a wider audience onto their platform, including data analysts and data scientists, then they were going to have to create something that they would be familiar with. What better thing than a DataFrame? If there's one thing that data scientists with R or a Pandas background are familiar with it's a DataFrame. So just as a refresher, in Spark a DataFrame is a distributed collection of objects of type rule. You can think of this as a table in a relational database or an Excel document, except there are some significant optimizations taking place under the hood. So while a table in Excel will sit on a single computer, a Spark DataFrame could sit across hundreds of computers. You can also create a DataFrame from a wide variety of sources, such as structured data files, tables in Hive, external databases, or existing RDDs. Before we head over to the exercise files to start exploring DataFrames, let's talk a little bit about the dataset API as you might have heard about them if you've used Scala. Datasets are an API that you use when using a statically typed language, like Java or Scala. Because Python is a dynamically typed language, it doesn't support the dataset API, but fortunately many of the benefits of the dataset API are already available via DataFrames. So on that happy note, let's head over to our exercise files to start working with DataFrames.

Contents