What is Bucketing in Hive?


Bucketing is a method in Hive which is used for organizing the data. It is a concept of separating data into ranges known as buckets. Bucketing in hives comes helpful when the use of partitioning becomes hard. A user can determine the range of a specific bucket by the hash value.

Partitioned tables can be bucketed to separate the data further to perform queries more efficiently. Every bucket is stored as a file within the table or the partition’s directories on HDFS. The records having a similar value within a column are always stored in the same bucket. Bucketing can be created above a partitioned table, which then splits the table for better query performance.

Advantages of Bucketing

Following are the advantages of bucketing −

  • Comparatively to non-bucketed tables, bucketed tables offer well-organized sampling.

  • Compared to non-bucketed tables, map-side joins are quicker in bucketed tables.

  • Bucketed tables give much more efficient query responses.

  • Bucketing is flexible in storing the records in every bucket to be organized by one or more columns.

Note − Bucketing does not take charge of populating the table correctly. As a result, the end-user has to load the data into a bucket on its own.

Characteristics of Bucketing

Following are the major characteristics of bucketing −

  • Bucketing is built on the concept of hashing function in the bucketed column.

  • The hash_function is based on the variety of the bucketing table. However, the system will permanently save data with similar bucketed columns in the same bucket.

  • The CLUSTERED BY clause is used to separate tables into buckets.

  • Each bucket consists of a single file in the table directory.

  • Bucketing in hives can be used both with and without partitioning.

  • Data files made up of bucketed tables are nearly evenly distributed.

Difference Between Hive Partitioning and Hive Bucketing

The partitioning and bucketing are a lot similar. They both separate the data before storing it. There are some significant differences between them. Partitioning carries the probability of multiple directories. Hence, it is useful for low-volume data. Bucketing contains an equivalent amount of data in every partition, making joins at the map side quicker.

A table is liable to have partitions and bucketing information both. In such cases, files inside every partition have a bucketed file.

Conclusion

All the things that we went through in this article conclude that bucketing in the hive is helpful for large dataset joins, which would be impossible without the availability of high-end computer resource capacity.

Bucketing in the hive is much more efficient for queries on bucketing columns with filters. Overall, resources can be used much more efficiently with bucketing tables. More buckets require less memory. But remember, too many buckets can cause unneeded parallelism.

Updated on: 25-Aug-2022

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements