From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,700 courses taught by industry experts or purchase this course individually.

Parallel writes with partitioning

Parallel writes with partitioning

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Parallel writes with partitioning

- [Instructor] As reviewed in the earlier videos, partitioning of data enables parallel reads and writes. It also helps in filtering out data while reading into memory. We will create a partition HDFStore based on the product column. There are only four unique products in the data cell. So it lends itself to easier partitioning. We simply need to add the partition buy method in the write process to trigger partitioning while storing data. We then save this to the partitioned parquet data tree. Let's run this code and examine the HDFS files created. Let's go and look at the HDFS files. When we navigate to the partition parquet directory, we see four subdirectories created. They are one per partition. The name of the directory shows the partition key and the value. This directory name can be then used to fill the data, and focus on directories that contain the relevant data only. In the next video, I will show you…

Contents