Search code examples
algorithmhashlanguage-agnosticchecksum

How to quickly determine if two sets of checksums are equal, with the same "strength" as the individual checksums


Say you have two unordered sets of checksums, one of size N and one of size M. Depending on the algorithm to compare them, you may not even know the sizes but can compare N != M for a quick abort if you do.

The hashing function used for a checksum has some chance of collision, which as a layman I'm foolishly referring to as "strength". Is there a way to take two sets of checksums, all made from the same hashing function, and quickly compare them (so comparing element to element is right out) with the same basic chance of collision between two sets as there is between two individual checksums?

For instance, one method would be to compute a "set checksum" by XORing all of the checksums in the set. This new single hash is used for comparing with other sets' hashes, meaning storage of size is no longer necessary. Especially since it can be modified for the addition/removal of an element checksum by XORing with the set's checksum without having to recompute the whole thing. But does that reduce the "strength" of the set's checksum compared to a brute force comparison of all the original ones? Is there a way to conglomerate the checksums of a set that doesn't reduce the "strength" (as much?) but still is less complex than a straight comparison of the set elements' checksums?


Solution

  • After my initial comment, I got to thinking about the math behind it. Here's what I came up with. I'm no expert so feel free to jump in with corrections. Note: This all assumes your hash function is uniformly distributed, as it should be.

    Basically, the more bits in your checksum, the lower the chance of collision. The more files, the higher.

    First, let's find the odds of a collision with a single pair of files XOR'd together. We'll work with small numbers at first, so let's assume our checksum is 4 bits(0-15), and we'll call it n.

    With two sums, the total number of bits 2n(8), so there are 2^(2n)(256) possibilities total. However, we're only interested in the collisions. To collide an XOR, you need to flip the same bits in both sums. There are only 2^n(16) ways to do that, since we're using n bits.

    So, the overall probability of a collision is 16/256, which is (2^n) / (2^(2n)), or simply 1/(n^2). That means the probability of a non-collision is 1 - (1/(n^2)). So, for our sample n, that means that it's only 15/16 secure, or 93.75%. Of course, for bigger checksums, it's better. Even for a puny n=16, you get 99.998%

    That's for a single comparison, of course. Since you're rolling them all together, you're doing f-1 comparisons, where f is the number of files. To get the total odds of a collision that way, you take the f-1 power of the odds we got in the first step.

    So, for ten files with a 4-bit checksum, we get pretty terrible results:

    (15/16) ^ 9 = 55.92% chance of non-collision

    This rapidly gets better as we add bits, even when we increase the number of files.

    For 10 files with a 8-bit checksum:

    (255/256) ^ 9 = 96.54%

    For 100/1000 files with 16 bits:

    (65536/65536) ^ 99 = 99.85%

    (65536/65536) ^ 999 = 98.49%

    As you can see, we're still working with small checksums. If you're using anything >= 32 bits, my calculator gets off into floating-point rounding errors when I try to do the math on it.

    TL,DR:

    Where n is the number of checksum bits and f is the number of files in each set:

    nonCollisionChance = ( ((2^n)-1) / (2^n) ) ^ (f-1)
    collisionChance = 1 - ( ((2^n)-1) / (2^n) ) ^ (f-1)
    

    Your method of XOR'ing a bunch of checksums together is probably just fine.