Spark allows users to work with large datasets in a distributed environment. In this video, learn how PySpark allows Python users to benefit from the powerful features of Spark.
- [Narrator] So we've taken a look at what Apache Spark is all about and what is available in its ecosystem. The next question you probably have is, what is PySpark? Well, remember that Spark is written in Scala. So PySpark is just a Python wrapper around the Spark core. So one of the things you're probably thinking is I've heard about Spark, but I've also heard about pandas, Hadoop and Dask. Why would I use Spark instead of these? And more importantly, when would I use one of these? So let's take a quick look at pandas, Hadoop and Dask if you haven't worked with them before.
Pandas is great for tabular data. As far as data wrangling is concerned, there are several more options and features available compared to Spark because it's been around longer. Pandas can handle hundreds of thousands if not millions of rows. But what happens when your data is so large that it has to be stored across several computers or your computer just doesn't have the processing capability to process the data quickly enough. Now unfortunately, at this point, you have to move away from pandas as it isn't a distributed system and find another solution.
And that's exactly what Apache Spark does. And PySpark makes it easy for us to use Apache Spark, if you're familiar with Python. So your next question is probably going to be while Hadoop is a distributed cluster, why not Hadoop instead of Spark. Until a couple of years ago, Hadoop was the big data platform. Hadoop has a compute system called MapReduce, and a storage system called the Hadoop file system. This allowed you to get the benefits of clustering several commodity service together and was designed for local storage.
The only problem is that they're closely integrated and so it's really difficult to run one without the other. Public cloud is one example of this. You can get AWS or Azure or Google Cloud Storage separately from compute. Spark has the advantage that you can use it on Hadoop storage, or you can use it in a public cloud environment. The thing is, if you have a single machine, it's unlikely to crash but if you're working with several machines, then one of them will probably crash at some point. So how can you make sure you don't lose any of the data if one of the machines crashes? Distributed systems like Hadoop have a Distributed File System or HDFS that splits the files into chunks called blocks and then replicates the blocks across several machines.
If one of the machines fail, HDFS will just request that block from another machine that has it. Now one of the key differences between Spark and Hadoop lies on their approach to processing. Spark can do it in memory, while Hadoop MapReduce has to read from and write to a disk. As a result, the speed of processing differ significantly and Spark can be up to 100 times faster. The general rule of thumb for an on-prem installation is that Hadoop requires more memory on disk and Spark requires more RAM meaning that setting up Spark clusters can be more expensive.
You would choose Hadoop mainly for disc heavy operations with the MapReduce paradigm and Spark tends to be more flexible, but more costly in memory processing architecture. If you haven't heard of Dask before, it's a library for parallel computing in Python. I'm going to take a little longer on this comparison as it's less obvious when you should be using PySpark versus Dask. As you know by now, PySpark is written in Scala, but has support for Java, Python, R and SQL and interpolates well with JVM code. Dask on the other hand, is only written in Python and only really supports Python.
In terms of how well they both scale. They can both go from a single node to 1000 node cluster. Let's talk a little bit about the ecosystem and we use the Apache Spark ecosystem to help us compare the two. I just bring it up in case you've forgotten it. She's got the data frames and SQL, streaming, machine learning library and graph computation. Okay, so back to our comparison. Spark is an all in one project so it has its own ecosystem. In the case of Dask, It's part of a larger Python ecosystem and works really well with other Python libraries such as non pie, pandas and psychic learn.
Sparks data frame has its own API and implements a good chunk of the SQL language. It also has a high level query optimizer for complex queries. Dask on the other hand, uses the Pandas API and that's what pandas is great at. So things like time series operations, indexing and Dask doesn't support SQL. Spark support for streaming is brilliant and you can get great performance on large streaming operations. Dask probably allows you to do more complex streaming use cases but it requires a lot more work.
Spark MLlib has great support for common machine learning operations. Dask on the other hand, relies on and intraoperates with Python's well known scikit-learn library and you might get a little better performance here. Finally, Sparks graphics library allows you to do graph processing. Dask on the other hand doesn't have a library to do graph processing. So to summarize, you might want to use Spark if you have really strong Scala and SQL skills and you have a JVM or legacy infrastructure and you want an all in one solution.
- Benefits of the Apache Spark ecosystem
- Working with the DataFrame API
- Working with columns and rows
- Leveraging built-in Spark functions
- Creating your own functions in Spark
- Working with Resilient Distributed Datasets (RDDs)