Join Jonathan Fernandes for an in-depth discussion in this video Working with DataFrames, part of Apache PySpark by Example.
- [Instructor] In the next few videos, I'll provide both the Panda syntax and the PySpark syntax. If you're familiar with Pandas it'll make the transition from Pandas to PySpark just that little bit easier. When working with Pandas, we need to import the Pandas library. With Apache Spark, we need a Spark session as our interface so all we do is import PySpark and get access to the Spark session via Spark. Assuming that we're reading a CSV file in, then we could create a DataFrame by loading this CSV file. If you look at the Pandas documentation, you can see that there are significantly more options available to you when reading a CSV file.
Spark allows you to read a CSV file by just typing spark.read.csv and the path to that file. In Pandas, you can view the first few rules of your DataFrame by specifying the DataFrame name and the number of rules you want to view. In this instance, we want to view the first three rules of the DataFrame DF. In Spark, you have a couple of options. DF.take with the option 3 will return a list of the rule objects. DF.collect will get all of the data from the entire DataFrame and you'll need to be careful when using it.
This is because if you have a large data set when you run collect, you can easily crash the driver node. And finally, if you want Spark to print out your DataFrame in a nice format, then try DF.show with the number of rules that you want to see. The limit function returns a new DataFrame by taking the first end rows. The difference between this function and head is that head returns an array while limit returns a new DataFrame. Now if you're a little confused between the differences between these, let's take a quick look at the documentation and some code.
So let's head over to the Apache website, select API docs, select Python, and let's do a search for limit. We want limit that's related to the DataFrame. And let's select source for source code. If you look at the limit function, you can see that it's returning a DataFrame. Next let's look at take and we can see that the take function calls collect on the limit function.
Let's look for the head function. And you can see from the head function that it returns the take function. So head and take are very similar as they return a list. And finally, let's look for the show function. And you can see that the show will just print out the data in a nice format. In the next video, we look at schemas.
- Benefits of the Apache Spark ecosystem
- Working with the DataFrame API
- Working with columns and rows
- Leveraging built-in Spark functions
- Creating your own functions in Spark
- Working with Resilient Distributed Datasets (RDDs)