I'm trying to learn the full story behind hashCode
. In most implementations hashCode
is fully deterministic, like in StringUTF16
class:
public static int hashCode(byte[] value) {
int h = 0;
int length = value.length >> 1;
for (int i = 0; i < length; i++) {
h = 31 * h + getChar(value, i);
}
return h;
}
I think that such implementation is not great: it's easy to construct examples which have the same hashCode. For example, a user of a system can submit exactly the words with the same hashCode
for DOS attack. It doesn't work with String
, since it implements Comparable
(and HashMap
is an over-hacked mess), but it won't help with classes which don't implement Comparable
.
A better approach seems to use a random factor (instead of 31
), so that the user don't know how to construct bad examples (and it also has some theoretical properties), like this:
class ImmutableArray{
// Note static keyword. It guarantees that for the same run all objects use the same x.
private static final int x = generateRandomPrime();
int[] values;
public int hashCode() {
int res = 5;
for (int v : values) {
res = res * x + v;
}
return res;
}
...
}
Now, my question: is there anything bad about this implementation? The only problem I can see is that it will return different hashCodes for different runs of the program, but I can't imagine a concrete scenario where something can go wrong.
It is NOT a requirement that hashCode
gives the same values in different JVMs. For example, the HashMap
class does not persist the hashCode
values of the map's keys when it is serialized. Instead, the hashCode
values are recomputed when the map is deserialized.
The only potential problem I can see is that recomputing the hashCode
on each call is inefficient. You can address that by computing it lazily (like String::hashCode
does for example).
But if you implement lazy hashCode
calculation, you need to declare the field where you store it as transient
. Otherwise, the hashCode
value in a de-persisted key instance won't ==
the hashCode
value computed for another instance that is "equal" to the key. (In other words, the hashcode / equals contract is broken!) This will lead to lookup failure.
If you do this properly, there should be no problem vis-a-vis serialization of HashMap
. For example, you could follow the approach of String::hashCode
and use zero as the cached hashCode
value which means "the code needs to be calculated" to the hashCode()
method.
(If your key class doesn't have a field to hold a cached hashCode
value, the problem with persisting that value doesn't arise.)
The other thing to note is that modifying the key class to implement Comparable
would be another defense against DOS-based attacks. In your example class, the implementation of the compareTo
method is simple. Note that the ordering that you implement doesn't need to be semantically meaningful. It just needs to be stable and consistent.