From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,700 courses taught by industry experts or purchase this course individually.

Data loading

Data loading

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Data loading

- [Instructor] In this video, we will look at loading the use case data into HDFS. This is pretty straightforward. First, we read the CSV file into the rawStudentData data frame. You may face problems writing this data frame back to Parquet, since some column names have spaces in them. So we also rename these columns to have no spaces. We print the schema and the data to make sure that everything went fine with the reading process. Let's execute this code. Next, we create a partition data store in Parquet format and with compression as "gzip," as required by the use case. In the case of partitioning, we have two columns that are frequently used in the future exercises. They are student and subject. Both are extensively used in these exercises. We could have gone with either of them or both of them for partitioning. Given the subject has a limited number of values as opposed to student names, we go with subject…

Contents