Search code examples
securityhashcrc16

What is the possibility of CRC16 collisions on 20 bytes of data?


I am developing a system which needs to store a hash for a structure 20 bytes maybe less in length. However, in order to optimize the process of looking up the hash in a series of hashes, we want to reduce the size of the hash as much as possible.

So my question is this, is there a relationship between the amount of data we feed into a crc16 hash and the probability of its collision with another entry of the same length? If so, which is the most optimal length for this?

18 of the bytes fall within the ascii table (a-z, 0-9), and the remaining range between 0 and 10


Solution

  • The following simple script runs an infinite loop that fetches 2 random 20-byte sequences, calculates CRC16 and checks if there is a collision. Continuous evaluation of this loop de facto estimates collision percentage:

    #!/usr/bin/env perl
    
    use Digest::CRC qw(crc16);
    
    open(my $f, '<', '/dev/urandom');
    my $n = 0;
    my $coll = 0;
    
    while (1) {
        read $f, $randstr1, 20;
        read $f, $randstr2, 20;
        my $crc1 = crc16($randstr1);
        my $crc2 = crc16($randstr2);
    
        $n++;
        $coll++ if $crc1 == $crc2;
    
        printf "percent of collisions = %.6f%%\n", $coll * 100.0 / $n if ($n % 100000 == 0);
    }
    

    From what I get on my computer, collision percentage seems to be around 0.0016% (or 1e-5, or "1 in 100_000"), which is way worse than predicted estimates based on ideal hashing distribution of a 16-bit hash (such as 2^16 / 2^160).

    UPDATE: I see you've clarified that 20 bytes are not just fully random bytes, but fall into range of [a-z0-9]. Here's the updated version that estimates collisions in that alphabet:

    #!/usr/bin/env perl
    
    use Digest::CRC qw(crc16);
    
    my $n = 0;
    my $coll = 0;
    my @chars = ('a'..'z', '0'..'9');
    
    sub randstr() {
        my $res;
        foreach (1..20) { $res .= $chars[rand @chars]; }
        return $res;
    }
    
    while (1) {
        my $crc1 = crc16(randstr());
        my $crc2 = crc16(randstr());
    
        $n++;
        $coll++ if $crc1 == $crc2;
    
        printf "percent of collisions = %.4f%%\n", $coll * 100.0 / $n if ($n % 100000 == 0);
    }
    

    It yields approximately the same result, about 0.0016%.