Search code examples
algorithmrsyncchecksum

overlapping chunks in rsync algorithm


I was going through rsync utility and in the section: Determining which parts of a file have changed it says: The sender computes the rolling checksum for every chunk of size S in its own version of the file, even overlapping chunks.

Here comes the doubts:

  • Why recipient calculates checksum for non-overlapping block but sender calculates overlapping block of size S?
  • And is the size S is same for recipient and sender?
  • What is overlapping block? Is it, suppose you have text as: abcdefgh and s=4 than receiver will calculate checksum for abcd and efgh and send those to sender. And than sender will calculate checksum for: abcd, bcde, cdef, defg, efgh (is this overlapping chunk) and compare with the receiver.
  • Then how it calculates the diff i.e. part which is different in sender and receiver?

Solution

  • If the recipient calculated rolling checksums for all the overlapping chunks, there would be almost as many checksums as the size of the file. So sending the checksums to the sender would be more expensive than sending the entire contents of the file, and there would be no savings. So the recipient splits the file into large, non-overlapping chunks, so it can send a small number of checksums and hashes.

    The size S has to be the same on the sender and recipient. Otherwise, the checksums and hashes could never be expected to match.

    A more detailed explanation of the algorithm can be found here. It makes it clear that S is the same on the sender and recipient.

    Here's an example. Suppose the file on the recipient is:

    abcdefghijkl
    

    and S = 4. The recipient will send checksums and hashes for abcd, efgh, ijkl to the sender.

    Suppose one character X is inserted into the file on the sender:

    abcdeXfghijkl
    

    It will calculate checksums for:

    abcd
    bcde
    cdeX
    deXf
    eXfg
    Xfgh
    fghi
    ghij
    hijk
    ijkl
    

    It will compare these to the checksums received from the recipient, and determine that abcd and ijkl have not changed. All it needs to send is eXfgh, with instructions that this should replace the efgh chunk.