Search code examples
algorithmrabin-karp

how does rabin-karp choose breakpoint in variable-length chunking?


I understand the rabin-karp algo and its usage in string searching. What I don't quite understand is how it can dynamically slice a file into variable-length chunks. It's said to calculate the hash of a small window of data bytes (ex: 48 bytes) at every single byte offset, and the chunk boundaries—called breakpoints—are whenever the last N (ex: 13) bits of the hash are zero. This gives you an average block size of 2^N = 2^13 = 8192 = 8 KB. Questions:

  1. Does the rabin-karp rolling hash start from the first 48 bytes, then roll over one byte each time.
  2. If so, is it too much to calculate for a large file even with simple hash function?
  3. Given unpredictable data, how is it possible to have N bits of the hash are zero within the large chunk size limit?

Solution

    1. Yes, the sliding window is fix-sized, moving forward byte by byte.
    2. The hash function has O(n) complexity, in each step it only add (and may shift bits) the next byte and minus the original first byte in the window, which is the core method of Rabin hash.
    3. It depends on the hash function actually. The distribution of the chuck sizes may be different. To reduce chunk size variability, the Two Thresholds, Two Divisors Algorithm (TTTD) was proposed. You can also find some advances in this thread from academic research papers.