Search code examples
javahashnon-deterministic

Is it OK for hashCode to return different values between different runs?


I'm trying to learn the full story behind hashCode. In most implementations hashCode is fully deterministic, like in StringUTF16 class:

public static int hashCode(byte[] value) {
    int h = 0;
    int length = value.length >> 1;
    for (int i = 0; i < length; i++) {
        h = 31 * h + getChar(value, i);
    }
    return h;
}

I think that such implementation is not great: it's easy to construct examples which have the same hashCode. For example, a user of a system can submit exactly the words with the same hashCode for DOS attack. It doesn't work with String, since it implements Comparable (and HashMap is an over-hacked mess), but it won't help with classes which don't implement Comparable.

A better approach seems to use a random factor (instead of 31), so that the user don't know how to construct bad examples (and it also has some theoretical properties), like this:

class ImmutableArray{
    // Note static keyword. It guarantees that for the same run all objects use the same x.
    private static final int x = generateRandomPrime();

    int[] values;

    public int hashCode() {
        int res = 5;
        for (int v : values) {
            res = res * x + v;
        }
        return res;
    }

    ...

}

Now, my question: is there anything bad about this implementation? The only problem I can see is that it will return different hashCodes for different runs of the program, but I can't imagine a concrete scenario where something can go wrong.


Solution

  • It is NOT a requirement that hashCode gives the same values in different JVMs. For example, the HashMap class does not persist the hashCode values of the map's keys when it is serialized. Instead, the hashCode values are recomputed when the map is deserialized.

    The only potential problem I can see is that recomputing the hashCode on each call is inefficient. You can address that by computing it lazily (like String::hashCode does for example).

    But if you implement lazy hashCode calculation, you need to declare the field where you store it as transient. Otherwise, the hashCode value in a de-persisted key instance won't == the hashCode value computed for another instance that is "equal" to the key. (In other words, the hashcode / equals contract is broken!) This will lead to lookup failure.

    If you do this properly, there should be no problem vis-a-vis serialization of HashMap. For example, you could follow the approach of String::hashCode and use zero as the cached hashCode value which means "the code needs to be calculated" to the hashCode() method.

    (If your key class doesn't have a field to hold a cached hashCode value, the problem with persisting that value doesn't arise.)


    The other thing to note is that modifying the key class to implement Comparable would be another defense against DOS-based attacks. In your example class, the implementation of the compareTo method is simple. Note that the ordering that you implement doesn't need to be semantically meaningful. It just needs to be stable and consistent.