From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,700 courses taught by industry experts or purchase this course individually.
Data loading
From the course: Big Data Analytics with Hadoop and Apache Spark
Data loading
- [Instructor] In this video, we will look at loading the use case data into HDFS. This is pretty straightforward. First, we read the CSV file into the rawStudentData data frame. You may face problems writing this data frame back to Parquet, since some column names have spaces in them. So we also rename these columns to have no spaces. We print the schema and the data to make sure that everything went fine with the reading process. Let's execute this code. Next, we create a partition data store in Parquet format and with compression as "gzip," as required by the use case. In the case of partitioning, we have two columns that are frequently used in the future exercises. They are student and subject. Both are extensively used in these exercises. We could have gone with either of them or both of them for partitioning. Given the subject has a limited number of values as opposed to student names, we go with subject…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.