From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Bucketing

Bucketing

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Bucketing

- [Instructor] As seen in the previous video partitioning is only optimal when a given attribute has a small set of unique values. What if we need to partition for a key with a large number of values without prolifercating the number of that increase? Bucketing is the answer. Bucketing works similar to partitioning, but instead of using the value of the attribute it uses a hash function to convert the value into a specific hash key. Values that have the same hash key end up in the same bucket or sub data tree. The number of unique buckets can be controlled and limited. This also ensures even distribution of values across all buckets. It's ideal for attributes that have a large number of uniques values like order number or transaction I.D. Choose buckets for attributes that have a large number of unique values and those that are most frequently used in query filters. Experiment with multiple buckets columns to find…

Contents