From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,700 courses taught by industry experts or purchase this course individually.

Pushing down projections

Pushing down projections

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Pushing down projections

- [Instructor] In this chapter, we will review some of the techniques that can be used during data processing to optimize Spark and HDFS performance. The code for this chapter is available in the notebook, code_05_XX Optimizing Data Processing. We will start with pushing down projections. Projection here means the set or subset of columns that are selected from a data set. Typically, read and enter your file with all the columns into memory and then use only a subset of columns later for computations. During lazy evaluation, Spark is smart enough to identify the subset of columns that will actually be used and only fetch them into memory. This is called projection push down. In this example, we read the entire Parquet file into the sales data data frame. Later, we only select the product and quantity columns. Spark identifies this and only fetches these columns into memory. Let's run this code and review the…

Contents