Search code examples
hiveacid

Why does a hive table need to be bucketed to support ACID transactions?


I'd like to know why a hive table needs to be bucketed to support ACID transactions. Is it just some hive quirk? Or was there a reason behind it?


Solution

  • Here's something about hive's compactor:

    The compactor runs background MapReduce jobs to compact the delta and base files. There are two types of compaction: major and minor. The minor compaction merges many small delta files into one big delta file. The major compaction is more expensive, it takes delta files and merges them with the base files. All merging happens by creating a new file and removing the old ones. There is a special cleaning process to do so. The compaction is done for each bucket separately. Base and Delta files are created per bucket.

    More here: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

    So, the more buckets, the faster the compaction.