Search code examples
javahashmapkeyhashsethashcode

What is the key in the Set's underlying HashMap: hash code or the object itself?


Is this sentence from an OCP book correct?

A HashSet stores its elements in a hash table, which means the keys are a hash and the values are an Object.

I have created below code snippet:

class A
{
    private String s;
    public static void main(String[] args) throws NoSuchFieldException, IllegalAccessException  {
        HashSet<A> elems =  new HashSet<>(Set.of(new A("abc"), new A("cba"), new A("a")));
        Field field = elems.getClass().getDeclaredField("map");
        field.setAccessible(true);
        HashMap privateMap = (HashMap) field.get(elems);
    }

    public A(String s)
    {
        this.s = s;
    }
}

And as a result I retrieved the following HashMap:
enter image description here

It looks like it's the object itself being a key in the HashMap instead of its hash being the key. Is there an error in the book or am I looking at something irrelevant?


Solution

  • Of course it is the object and not the hash. Some facts which are individually obvious or at least easy to verify:

    • Given that hashes are fixed size (2^32 different hashes exist, that's it though), and there are objects such as Strings for which more than 2^32 different values exist (An infinite amount of them), it is a mathematical certainty that different strings exist that hash to the same value. This is called the pigeon hole principle if you prefer to watch some videos or some such to explain this concept in more detail.
    • It would be incredibly annoying if HashMap/HashSet considers different strings as nevertheless equal just because their hash codes so happen to collide. In fact, that would mean these types aren't conforming to their own specification (javadoc), because they clearly state collisions will slow things down but aren't otherwise a problem.
    • Hence, HashSet/Map needs the original object so that it can use .equals() to verify that things are actually equal.

    The way hashmap/set works is as follows:

    • Use the hashCode of an object to find the 'bucket' to look in.
    • Then just scan, one at a time, each object in that bucket. Yes, this is slow - hence you don't want a ton of hash collisions.

    And thus this concludes:

    • Given two objects a and b, if a.equals(b) is true, then a.hashCode() == b.hashCode() HAS to be true, or you can't use these objects in hashset/hashmap because bizarre things will happen if you do, with keys/objects you just added disappearing into thin air but still showing up when iterating through, and so on.
    • The reverse DOES NOT HOLD - If a.hashCode() == b.hashCode() is true, then a.equals(b) doesn't have to be. It could just be a collision which is fine.
    • That means theoretically @Override public int hashCode() { return 1; } is valid. Indeed it is; HashMap and HashSet will operate as described, as long as the equals method is working fine. However, such maps/sets will be O(n) - their performance is linear to how many items are in there. Which is unexpected; the point of hashset/map is that their performance doesn't meaningfully change as they grow. HashMap/Set aren't magic - they need a good hashCode() impl to fulfill their performance promises.

    If you read some docs or books that say the hashcode is the key, then that's either from a section that is oversimplifying (accent on the over - that's probably not a good thing to put in a tutorial or explanation), or the author misunderstands how this stuff works, or its a description for hashmap impl in another system/language which takes the highly dubious shortcut of equating 'equal hashes' to '.. thus equal'. Given Pigeonhole Principle and all that, this is a bad idea. If you MUST do things that way, use a good hash algo and way more bits than 32!