Search code examples
stringhashstring-hashing

How to hash variable-length strings


I am very much a beginner in encryption/hashing. And I want to know how to hash a variable length string, maybe 10 or 100 letters to a fixed length code, e.g. 128-bit binary, regardless of the underlying programming language, while achieving relatively equal collisions among the bins.

Specifically, how to deal with inputs of different inputs, and make the hashcode evenly distributed?


Solution

  • There are many different ways to do this.

    For non-cryptographic applications, it's common to hash strings by iterating over the characters in sequence and applying some operation to mix in the bits of the new character with the accumulated hash bits. There are many variations on how exactly you'd carry this out. One common approach is shown here:

    unsigned int kSmallPrime = /* some small prime */;
    unsigned int kLargePrime = /* some large prime */;
    
    unsigned int result = 0;
    
    for (char ch: string) {
        result = (result * kSmallPrime + ch) % kLargePrime;
    }
    

    More complex combination steps are possible to get better distributions. These approaches generally don't require the string to have any specific length and work for any length of string. The number of bits you get back depends on what internal storage you use for mixing up the bits, though there's not necessarily a strong theoretical reason (other than empirical evidence) to believe that you have a good distribution.

    For cryptographic applications, string hash functions are often derived from block ciphers. Constructions like Merkle-Damgard let you start with a secure block cipher and produce a secure hash function. They work by padding the string up to some multiple of the block size using a secure padding scheme (one that ensures that different strings end up different after padding), breaking the string apart into blocks, and hashing them in a chain. The final output is then derived from the underlying block cipher, which naturally outputs a large number of bits, and the nice distribution comes from the strength of the underlying block cipher, which (in principle) should be indistinguishable from random.