From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Bucketing
From the course: Big Data Analytics with Hadoop and Apache Spark
Bucketing
- [Instructor] As seen in the previous video partitioning is only optimal when a given attribute has a small set of unique values. What if we need to partition for a key with a large number of values without prolifercating the number of that increase? Bucketing is the answer. Bucketing works similar to partitioning, but instead of using the value of the attribute it uses a hash function to convert the value into a specific hash key. Values that have the same hash key end up in the same bucket or sub data tree. The number of unique buckets can be controlled and limited. This also ensures even distribution of values across all buckets. It's ideal for attributes that have a large number of uniques values like order number or transaction I.D. Choose buckets for attributes that have a large number of unique values and those that are most frequently used in query filters. Experiment with multiple buckets columns to find…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.