From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Best practices for data processing
From the course: Big Data Analytics with Hadoop and Apache Spark
Best practices for data processing
- [Instructor] In this video, I will review the best practices for data processing with Spark and HDFS. Use push downs for filters and projections to data sources as much as possible. The smaller the data being transferred, the better is the performance. Choose and design partition keys based on the columns most used in filters and aggregations. This speeds up both reading and processing data. Use repartitioning and coalescing wisely. These activities themself take significant time, so only use them if there are a series of transforms that can take advantage of them. Avoid joins as much as possible. Use denormalized data sources. If required, use them judiciously and check execution plans. Clock all operations with spark.time() on production equivalent data to understand slow-running operations and take actions. Use caching when appropriate. Caching takes memory and disk space, hence choose them for intermediate…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.