From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Reading partitioned data
From the course: Big Data Analytics with Hadoop and Apache Spark
Reading partitioned data
- [Instructor] In this video, we will read a partitioned data set into Spark and understand how it works. We will leave the Parquet files under the directory, partitioned_parquet. The product name, which is the partition value, will not be stored inside the files, as it is already available in the data to remain. The base part needs to be provided for the data to read the product name also as a column. We again time the operation and display the first five rows. We will also print the execution plan. Let's run this code and review the results. The most important addition to the physical plan is the partition count. This shows the number of partitions read into memory. More partition means more I/O and memory requirements. Reducing this count will lead to better performance. We will see techniques for this later in the course. Next, we only read one partition from the stored data. If we need to unlace only a subset of data,…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.