From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,700 courses taught by industry experts or purchase this course individually.

Improving joins

Improving joins

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Improving joins

- [Instructor] Joins in Spark help combining two data sets to provide better insights. As important as they are in analytics, they also cause significant delays. Join between two data frames require that the partitions of these data frames to be shuffled, so rows that match the join condition are in the same executor nodes. Spark optimizer again does a lot of behind-the-scenes work to optimize joins. One of them is called a Broadcast join. If the size of one of the joined data frames is less than the spark.sql.autoBroadcastJoinThreshold, then Spark broadcasts the entire data frame to all executor nodes where the other data frame resides. Then, the join itself becomes a local activity within the executor since the entire copy of the smaller data frame is available in that node. In this example, we read the product_vendor.csv into the products data frame. This data frame itself is very small. We then join it with sales data…

Contents