From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Writing to HDFS

Writing to HDFS

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Writing to HDFS

- As discussed in the previous videos, CSV files cannot be used for parallel reads and writes. We need to convert them to other formats like Parquet, for efficient processing of data in the later stages. In this video, we will write the raw sales data data frame into a Parquet file in HDFS. The code for this is simple, We will use the right function available in the data frame. We didn't set the format to Parquet, the mode is set to overwrite, to overwrite any existing contents. In real pipelines though, append maybe the better option if there are periodic additions to the data. We then use GZIP to compress the data. We save it to the raw Parquet directory under user/Raj_ops Let's execute this code and review the results. First, notice the Spark job feature appearing at the top of the paragraph. You can click on this to open the Spark UI and look at how Spark executed this job. The spark UI may launch with a fully qualified…

Contents