Start free trial Sign in

From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Best practices for data extraction

Best practices for data extraction

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Best practices for data extraction

“

- [Instructor] What are some of the key best practices for data extraction from HDFS into Spark for analytics? The first is to read only the required data into memory. This means read subdirectories, subset of partitions, and subset of columns. Less data means less resource requirements and less time to execute. Use data sources and file formats that support parallelism. Avro and Parquet are some of the recommended ones. The number of partitions in the data files are important. Each partition can be independently read by a separate executor code in parallel. The number of parallel operations in a Spark cluster is the number of executor nodes multiplied by the number of CPU cores in each executor. If the number of partitions are greater than this value, it will trigger maximum parallelism. Please keep in mind that other jobs running at the same time will also compete for these resources. In the next chapter, I will focus on…

Contents