From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Managing partitions

Managing partitions

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Managing partitions

- [Explainer] One of the key aspects to understand about Spark internals is partitioning. This is different from HDFS partitioning. When Spark creates a partition file, it creates internal partitions, equal to the default parallelism, setup for Spark. Transforms maintain the same number of partitions. But actions will create a different number, usually equal to the default parallelism setup for this Spark instance. Typically, in a Local node, parallelism is two. And in a cluster mode, it's 200. Having too many or too little partitions will impact performance. As discussed in the earlier video, the ideal number of partitions should be equal to, the total number of cores available for Spark. We can change the number of partitions by repartitioning and coalescing. Let's run the exercise code first and then review the results. We first print the default parallelism, setup for this cluster. It's two, this number can…

Contents