From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,700 courses taught by industry experts or purchase this course individually.
Improving joins
From the course: Big Data Analytics with Hadoop and Apache Spark
Improving joins
- [Instructor] Joins in Spark help combining two data sets to provide better insights. As important as they are in analytics, they also cause significant delays. Join between two data frames require that the partitions of these data frames to be shuffled, so rows that match the join condition are in the same executor nodes. Spark optimizer again does a lot of behind-the-scenes work to optimize joins. One of them is called a Broadcast join. If the size of one of the joined data frames is less than the spark.sql.autoBroadcastJoinThreshold, then Spark broadcasts the entire data frame to all executor nodes where the other data frame resides. Then, the join itself becomes a local activity within the executor since the entire copy of the smaller data frame is available in that node. In this example, we read the product_vendor.csv into the products data frame. This data frame itself is very small. We then join it with sales data…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.