From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Problem definition
From the course: Big Data Analytics with Hadoop and Apache Spark
Problem definition
- [Instructor] In this chapter, we will take up a use-case problem and build a solution using Apache Spark and Hadoop. During this process, we will leverage various tools and techniques we learned during the course. Here is the problem we need to solve. We have our source data in Student_scores.csv file as part of the course resources. This file has student scores by subject during a school year. There are four attributes in this data source, the student name, the subject, the class score, which indicates the score the student got from his assignments, and the test score, which indicates the score the student got from his final examination. The use-case actions to execute are as follows. Load the CSV into HDFS. The data should be loaded in Parquet format, and the compression used should be GZIP. Choose a partitioning scheme that would fit this data based on the analytics problems solved in this use case. Next, read…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.