From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Pushing down filters

Pushing down filters

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Pushing down filters

- [Instructor] Similar to projection push downs, Spark is capable of identifying a subset of rows that are actually required for processing and fix them into memory. If the subset of rows correspond to specific partitions, Spark will only leave those partitions. In this example, we try two filters. One is the filter on Product. Product is a partition column. The other is a filter on a non-partition column called Customer. Spark will push down both these filters to the file scan, so it will not read unnecessary rows into memory. Let's execute the code and review the results. In the case of a filter on a partition column, we see partition filters being used. The partition count is also one. This means that Spark has identified that the filter is on a partition and only attempts to read files for that partition. In the case of filter without a partition column, we see pushed filters on the file scan. But the partition…

Contents