Regarding the example in Oracle's java online tutorial that uses HashMap to store anagram

I was reading the example in Oracle's online java tutorial that uses HashMap to store anagrams:

    // Read words from file and put into a simulated multimap
    Map<String, List<String>> m = new HashMap<String, List<String>>();

    try {
        Scanner s = new Scanner(new File(args[0]));
        while (s.hasNext()) {
            String word = s.next();
            String alpha = alphabetize(word);
            List<String> l = m.get(alpha);
            if (l == null)
                m.put(alpha, l=new ArrayList<String>());
            l.add(word);
        }
    } catch (IOException e) {
        System.err.println(e);
        System.exit(1);
    }

    // Print all permutation groups above size threshold
    for (List<String> l : m.values())
        if (l.size() >= minGroupSize)
            System.out.println(l.size() + ": " + l);
}

private static String alphabetize(String s) {
    char[] a = s.toCharArray();
    Arrays.sort(a);
    return new String(a);
}

}

Since HashMap is implemented with Hash Table, I think that each sorted alphabetized string should have a unique hash code after compression(otherwise the linked list that stores the values in the HashMap will store a value that's not a anagram of the sorted alphabetized string).

I am not exactly sure how Java's implementation of HashMap can satisfy this - I assume that they use the hash code of a string (a1*31^n-1 + a2*31^n-2 + ... + an). This might guarantee the uniqueness of the hash code if we are talking about strings with only lower case chars. However, one also has to compress the hash code before putting the value of the key to the bucket in the hash table (otherwise you would have a hugggggge hash table that can't be handled in memory, just thinking of how big 31^10 is). Among this compression, I would think that there will be collision. In other words, two different strings that are not true anagram will end up being stored in the same bucket (which should be only used to store a list of true anagrams)...

Could anyone help me to understand what I might be missing? Or if the online tutorial is missing something?

thanks!

Jason

Solution

Never, ever assume that hash codes are unique -- but realize that HashMap already knows that hash codes are non-unique, and deals with them appropriately.

In other words, even if a.hashCode() == b.hashCode(), if !a.equals(b), then a HashMap will not mix up the value associated with a and the value associated with b.