From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Best practices for data processing

Best practices for data processing

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Best practices for data processing

- [Instructor] In this video, I will review the best practices for data processing with Spark and HDFS. Use push downs for filters and projections to data sources as much as possible. The smaller the data being transferred, the better is the performance. Choose and design partition keys based on the columns most used in filters and aggregations. This speeds up both reading and processing data. Use repartitioning and coalescing wisely. These activities themself take significant time, so only use them if there are a series of transforms that can take advantage of them. Avoid joins as much as possible. Use denormalized data sources. If required, use them judiciously and check execution plans. Clock all operations with spark.time() on production equivalent data to understand slow-running operations and take actions. Use caching when appropriate. Caching takes memory and disk space, hence choose them for intermediate…

Contents