From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Reading partitioned data

Reading partitioned data

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Reading partitioned data

- [Instructor] In this video, we will read a partitioned data set into Spark and understand how it works. We will leave the Parquet files under the directory, partitioned_parquet. The product name, which is the partition value, will not be stored inside the files, as it is already available in the data to remain. The base part needs to be provided for the data to read the product name also as a column. We again time the operation and display the first five rows. We will also print the execution plan. Let's run this code and review the results. The most important addition to the physical plan is the partition count. This shows the number of partitions read into memory. More partition means more I/O and memory requirements. Reducing this count will lead to better performance. We will see techniques for this later in the course. Next, we only read one partition from the stored data. If we need to unlace only a subset of data,…

Contents