From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Managing partitions
From the course: Big Data Analytics with Hadoop and Apache Spark
Managing partitions
- [Explainer] One of the key aspects to understand about Spark internals is partitioning. This is different from HDFS partitioning. When Spark creates a partition file, it creates internal partitions, equal to the default parallelism, setup for Spark. Transforms maintain the same number of partitions. But actions will create a different number, usually equal to the default parallelism setup for this Spark instance. Typically, in a Local node, parallelism is two. And in a cluster mode, it's 200. Having too many or too little partitions will impact performance. As discussed in the earlier video, the ideal number of partitions should be equal to, the total number of cores available for Spark. We can change the number of partitions by repartitioning and coalescing. Let's run the exercise code first and then review the results. We first print the default parallelism, setup for this cluster. It's two, this number can…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.