Search code examples
javainteger-overflow

Is there a way to adjust for integer overflow?


I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.

The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.

private short sumHash(String s) {
    short hash=0;
    for (char c:s.toCharArray()) {
        hash+=c;
    }
    return hash;
}

Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.

private short productHash(String s) {
    short hash=1;
    for (char c:s.toCharArray()) {
        hash*=c;
    }
    return hash;
}

Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?


Solution

  • You will only hit zero if

    a * b = 0 mod 2^64
    

    which is equivalent to there being an integer k such that

    a * b = k * 2^64 
    

    That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:

    for (char ch : chars) {
      hash *= (ch << 1) | 1;
    }
    

    This allows you to keep 63 bits of information.

    Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.