From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Best practices for data storage

Best practices for data storage

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Best practices for data storage

- [Instructor] In this video, I will walk through some of the best practices for designing HDFS schema and storage. First, during the design stage, understand the most used, read and write patterns for your data. Identify if it's read intensive or write intensive or both. For reads analyze what filters are usually applied on data, Determine What needs optimization and what can be compromised. Is it important to reduce storage requirements or is it okay to compromise on storage for better read-write performance? Choose your options carefully as these cannot be easily changed after the pipeline is deployed and data is created. Changing things like storage formats and compression cortex would require reprocessing all the data. Run tests on actual data to understand performance and storage characteristics. Experiment if required to compare between different storage options available. Choose partitioning and bucketing…

Contents