From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Pushing down filters
From the course: Big Data Analytics with Hadoop and Apache Spark
Pushing down filters
- [Instructor] Similar to projection push downs, Spark is capable of identifying a subset of rows that are actually required for processing and fix them into memory. If the subset of rows correspond to specific partitions, Spark will only leave those partitions. In this example, we try two filters. One is the filter on Product. Product is a partition column. The other is a filter on a non-partition column called Customer. Spark will push down both these filters to the file scan, so it will not read unnecessary rows into memory. Let's execute the code and review the results. In the case of a filter on a partition column, we see partition filters being used. The partition count is also one. This means that Spark has identified that the filter is on a partition and only attempts to read files for that partition. In the case of filter without a partition column, we see pushed filters on the file scan. But the partition…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.