Search code examples
c#hashhash-collisionhash-code-uniqueness

Will this hash function collide unusually frequently?


I had the following code to generate a hash of an object:

public int GetHashCode(MyType obj)
{
   return (obj.Prop1.GetHashCode() + obj.Prop2.GetHashCode() + obj.Prop3.GetHashCode()).GetHashCode();
}

I.e. I add all the properties' hash codes and then take the hash of this.

In review, a coworker suggested that this will collide too frequently. I'm not sure that this is true because:

  1. Given that hash codes are chosen with equal frequency among positive and negative numbers and they wrap around, I don't think there's any additional information we gain about the likelihood of these numbers' sum as opposed to the numbers themselves
  2. To the extent that their sum is non-random, hash codes are designed to make numbers that are "close together" become "far apart", so feeding a non-uniformly-distributed value into the function shouldn't be an issue

Who is correct?

It is in C#, in case the answer is language-specific.


Solution

  • Yes.

    Just suppose Prop1, Prop2 etc are of type int. Usually only the lower range of integers is used. Your sum approach will collide more often than necessary.

    The HasCode of 7 is 7, which makes perfect sense when hashing int by it self. But with your code the tuples <7, 3>, <3, 7> and <8, 2> will all have the same Hash. The same with simple XOR instead of Addition.

    The common approach is to add some (prime) numbers and shifting:

    public int GetHashCode(MyType obj)
    {
      int hash = 0;
      unchecked
      {         
         hash += 19 * obj.Prop1.GetHashCode();
         hash += 31 * obj.Prop2.GetHashCode();
         hash += 37 * obj.Prop3.GetHashCode();
      }
      return hash;
    }
    

    The numbers 19, 31, 37 are not too critical. And if you prefer you can use OR or XOR instead of + .