Search code examples
javaguavahashcodemultiset

Efficient hash code for multiset in Java


I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag). It may not contain null elements, although that's not crucial to my question. The equals contract defined by the interface is as you would expect:

  • obj instanceof MyInterface
  • obj contains the same elements as this (by equals)
  • obj contains the same number of duplicates for each element
  • order of elements is disregarded

Now I want to write my hashCode method. My initial idea was:

int hashCode = 1;
for( Object o : this ) {
    hashCode += o.hashCode();
}

However, I noticed that com.google.common.collect.Multiset (from Guava) defines the hash code as follows:

int hashCode = 0;
for( Object o : elementSet() ) {
    hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}

It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don't understand the benefit of ^ count(o) over simply adding up the hash codes of every duplicate. Maybe it's about not calculating the same hash code more than once, but then why not * count(o)?

My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.


Solution

  • Update

    Let's say, for example, in the case where we'd have an array that we want to treat as a multiset.

    So you have to process all the entries as they come, you can't use count, and can't assume the entries come in a known order.

    The general function I'd consider is

    int hashCode() {
        int x = INITIAL_VALUE;
        for (Object o : this) {
            x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
        }
        return h(x);
    }
    

    Some observations:

    • As already stated in the other answers, the INITIAL_VALUE doesn't matter much.
    • I wouldn't go for NULL_HASH=0 since this would ignore null values.
    • The function g can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).
    • The function h can be used to improve the result, which is not very important since this already happens e.g. in HashMap.hash(int).
    • The function f is the most important one, unfortunately, it's quite limited as it obviously must be both associative and commutative.
    • The function f should be bijective in both arguments, otherwise you'd generate unnecessary collisions.

    In no case I'd recommend f(x, y) = x^y since it'd make two occurrences of an element to cancel out. Using addition is better. Something like

    f(x, y) = x + (2*A*x + 1) * y
    

    where A is a constant satisfies all the above conditions. It may be worth it. For A=0 it degenerates to addition, using an even A is not good as it shift bits of x*y out. Using A=1 is fine, and the expression 2*x+1 can be computed using a single instruction on the x86 architecture. Using a larger odd A might work better in case the hashes of the members are badly distributed.

    In case you go for a non-trivial hashCode() you should test if it works correctly. You should measure the performance of your program, maybe you'll find simple addition sufficient. Otherwise, I'd for for NULL_HASH=1, g=h=identity, and A=1.

    My old answer

    It may be for efficiency reasons. Calling count may be expensive for some implementations, but entrySet may be used instead. Still it might be more costly, I can't tell.

    I did a simple collision benchmark for Guava's hashCode and Rinke's and my own proposals:

    enum HashCodeMethod {
        GUAVA {
            @Override
            public int hashCode(Multiset<?> multiset) {
                return multiset.hashCode();
            }
        },
        RINKE {
            @Override
            public int hashCode(Multiset<?> multiset) {
                int result = 0;
                for (final Object o : multiset.elementSet()) {
                    result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
                }
                return result;
            }
        },
        MAAARTIN {
            @Override
            public int hashCode(Multiset<?> multiset) {
                int result = 0;
                for (final Multiset.Entry<?> e : multiset.entrySet()) {
                    result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
                }
                return result;
            }
        }
        ;
        public abstract int hashCode(Multiset<?> multiset);
    }
    

    The collision counting code went as follows:

    private void countCollisions() throws Exception {
        final String letters1 = "abcdefgh";
        final String letters2 = "ABCDEFGH";
        final int total = letters1.length() * letters2.length();
        for (final HashCodeMethod hcm : HashCodeMethod.values()) {
            final Multiset<Integer> histogram = HashMultiset.create();
            for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
                for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
                    histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
                }
            }
            System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
        }
    }
    

    and printed

    Collisions GUAVA: 45
    Collisions RINKE: 42
    Collisions MAAARTIN: 0
    

    So in this simple example Guava's hashCode performed really bad (45 collisions out of 63 possible). However, I don't claim my example is of much relevance for real life.