From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Writing to HDFS
From the course: Big Data Analytics with Hadoop and Apache Spark
Writing to HDFS
- As discussed in the previous videos, CSV files cannot be used for parallel reads and writes. We need to convert them to other formats like Parquet, for efficient processing of data in the later stages. In this video, we will write the raw sales data data frame into a Parquet file in HDFS. The code for this is simple, We will use the right function available in the data frame. We didn't set the format to Parquet, the mode is set to overwrite, to overwrite any existing contents. In real pipelines though, append maybe the better option if there are periodic additions to the data. We then use GZIP to compress the data. We save it to the raw Parquet directory under user/Raj_ops Let's execute this code and review the results. First, notice the Spark job feature appearing at the top of the paragraph. You can click on this to open the Spark UI and look at how Spark executed this job. The spark UI may launch with a fully qualified…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.