From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Problem definition

Problem definition

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Problem definition

- [Instructor] In this chapter, we will take up a use-case problem and build a solution using Apache Spark and Hadoop. During this process, we will leverage various tools and techniques we learned during the course. Here is the problem we need to solve. We have our source data in Student_scores.csv file as part of the course resources. This file has student scores by subject during a school year. There are four attributes in this data source, the student name, the subject, the class score, which indicates the score the student got from his assignments, and the test score, which indicates the score the student got from his final examination. The use-case actions to execute are as follows. Load the CSV into HDFS. The data should be loaded in Parquet format, and the compression used should be GZIP. Choose a partitioning scheme that would fit this data based on the analytics problems solved in this use case. Next, read…

Contents