Join Jonathan Fernandes for an in-depth discussion in this video What you should know, part of Apache PySpark by Example.
- [Narrator] I have designed this course so there are plenty of practice exercises and do exactly what it says on the tin, which is to learn PySpark by example. We're going to be using Google's Colab to run our PySpark environment in the cloud. Now, if you haven't used any cloud environment before, don't worry, it's really very easy and I'll show you how to do it. You're also welcome to install your version of Spark locally and run the exercise files from there. I'm using Spark, Version 2.3, but you can easily use another version as long as it's at least Version 2. If you try and run Apache Spark locally, and if you end up with a whole lot of Java errors, I suggest you switch to the Google Colab environment for now.
I think your time is better spent learning how to use Spark, rather than learning how to install it. I would assume that most of you have some experience with working with Python's pandas, so I've included a section on what you should do in PySpark and what the pandas equivalent would be to make the transition easier. If you don't know about pandas, you can check out my other course on pandas in the LinkedIn Library. Don't worry if you don't know about pandas, you can still learn PySpark from scratch by following along.
- Benefits of the Apache Spark ecosystem
- Working with the DataFrame API
- Working with columns and rows
- Leveraging built-in Spark functions
- Creating your own functions in Spark
- Working with Resilient Distributed Datasets (RDDs)