Search code examples
javaequalshashcode

Discussion - Best way to hash a mix of (CustomClass1, CustomClass2, Collection<CustomClass3>)


I was reading the implementation of equals() and hashCode() in Java and came to a conclusion that there is possibility of collision (I never faced the collision in my case) but I was thinking to find a better way to get some unique identifier like GUID for an object that I have -

I have [CustomClass1 cc1, CustomClass2 cc2 and a Set scc3] coming in from a method and I am wrapping them in an object1 and doing object1.hashCode() and the hashCode() is working fine (handling the order of objects in the SET as well, so I am getting unique hash code even when the order of object in set is different say 1,2,3 or say 2,3,1 as long as they are "1", "2" and "3" but irrespective of order)

But I was thinking to find a better way to this - convert object1 to a GUID instead of object1.hashCode().

Is there a better way? If yes, share your thoughts.


Solution

  • There isn't a pragmatic better way.

    Given a hashing algorithm that just isn't designed well, clashes are common, and it doesn't matter if the hash value is 32-bit (as java's hashCode()) or 128-bit (UUIds).

    Java's built in types, such as String or List or HashSet or whatnot, do as well a job as is reasonable, so you shouldn't worry about them being badly designed. String's hashCode could be better, but changing it would be backwards incompatible. That isn't the say the current algorithm baked into java.lang.String is particularly bad either, though. It's fine.

    Once we've moved on from that, a clash in 32-bit space is highly unlikely. It happens once in, well, 4 billion entries. A clash is no problem - java doesn't flat out do the wrong thing. Try it!

    class StupidHashCodeImpl {
      private final String name;
    
      public StupidHashCodeImpl(String name) {
        if (name == null) throw new NullPointerException("name");
        this.name = name;
      }
    
      @Override public String toString() {
        return this.name;
      }
    
      @Override public int hashCode() {
        return 1; // all StupidHashCodeImpl hashes collide!
      }
    
      @Override public boolean equals(Object other) {
        if (other == null || other.getClass() != StupidHashCodeImpl.class) return false;
        return this.name.equals(((StupidHashCodeImpl) other).name);
      }
    }
    
    and then:
    
    Set<StupidHashCodeImpl> set = new HashSet<>();
    set.add(new StupidHashCodeImpl("foo"));
    set.add(new StupidHashCodeImpl("bar"));
    System.out.println(set.size()); // prints 2 as expected.
    set.contains(new StupidHashCodeImpl("foo")); //true
    set.contains(new StupidHashCodeImpl("bar")); //true
    set.contains(new StupidHashCodeImpl("baz")); //false
    

    See? Works great. No problems here, at all. The rules are clear:

    • If a.equals(b), then it must hold that a.hashCode() == b.hashCode() - failure to adhere to this rule means hashset starts handing out wrong answers.
    • The reverse is not true. If a.hashCode() == b.hashCode(), then it's totally fine if !a.equals(b). HashSet deals with this just fine.
    • hashCodes must be stable. That does mean that mutating an object that is in a hashset or is used as a key in a hashmap, breaks things.

    The only downside to a 'bad hashing' algorithm as above, is that the map/set becomes inefficient. Normally, if you shove a million keys into a hashset and then do some set.contains() operations, those are significantly faster versus shoving a million entries in a list and running .contains() on that. However, if you shove a million entries in a set and they all have colliding hashes, then HashSet is no better (in fact, somewhat worse) in performance than list, for obvious reasons.

    But, UUIDs don't solve this problem. At all:

    • If your hash algorithm is bad, UUIDs isn't going to reduce the odds of clashes. If you want to fix it, then fix the bad algorithm.
    • If your hash algorithm is good, clashes occur at a rate of about once in billions, so any performance downside to clashes always gets divided by a billion or so, reducing it to insignificance.
    • Contrast also that comparing 128-bit values is considerably more expensive than comparing 32-bit values. So, that cost of clashes you pay once in a billion? Okay, you can get rid of it with UUID, but then you're paying a small price for every lookup.

    You'd have to write it and do some performance testing with JMH to be sure, but I'd bet quite a lot on the idea that a UUID-based HashSet impl is significantly slower than java's own HashSet.

    To write your own 'good' hash algorithm, it's simple:

    Use Project Lombok's @Value or @Data or @EqualsAndHashCode and call it a day. And if you don't want to do that, or you just want to know how it works, click the link and check the example code which shows precisely what's going on. Alternatively, ask your IDE - most of them (including the 2 most popular IDEs, IntelliJ and eclipse) have 'generate equals and hashCode' somewhere in their menu structures. They do something similar to lombok.

    The general gist is:

    • Pick some constant. Preferably a prime number.
    • Start at 1.
    • "Integrate" each hash of each component (and generally all fields are components - everything that needs to affect the hashvalue) by first multiplying our current hash value by the prime number, then adding the hash of this new component.
    • Take special care for data types that have known-broken hashcode impls. For example, don't call someArray.hashCode() which doesn't do what you think it does, instad call Arrays.deepHashCode(someArray). hash up a long by XORing the higher 32 bits and the lower 32 bits together. Hash booleans by picking 2 arbitrary numbers (one for true and one for false).
    • Hash null to something. Just pick a value. Probably not 0.

    That's all there is to it.

    This will for example ensure that 1,2,3 and 2,3,1 have different hashCodes.