I am trying to get a better understanding how the internas of hashed sets, e.g. HashSet<T>
do work and why they are performant. I discovered following article, implementing a simple example with a bucket list http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/.
As far as I understand this article (and I also thought that way before), the bucket list itself groups certain amount of elements in each bucket. One bucket is represented by the hashcode, namely by GetHashCode
which is called on the element. I thought the better performance is based on the fact that there are less buckets than elements.
Now I have written following naive test-code:
public class CustomHashCode
{
public int Id { get; set; }
public override int GetHashCode()
{
//return Id.GetHashCode(); // Way better performance
return Id % 40; // Bad performance! But why?
}
public override bool Equals(object obj)
{
return ((CustomHashCode) obj).Id == Id;
}
}
And here the profiler:
public static void TestNoCustomHashCode(int iterations)
{
var hashSet = new HashSet<NoCustomHashCode>();
for (int j = 0; j < iterations; j++)
{
hashSet.Add(new NoCustomHashCode() { Id = j });
}
var chc = hashSet.First();
var stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < iterations; j++)
{
hashSet.Contains(chc);
}
stopwatch.Stop();
Console.WriteLine(string.Format("Elapsed time (ms): {0}", stopwatch.ElapsedMilliseconds));
}
My naive thought was: Let's reduce the amount of buckets (with a simple modulo), that should increase performance. But it is terrible (on my system it takes about 4 seconds with 50000 iterations). I also thought if I simply return the Id as hashcode, performance should be poor since I would end up with 50000 buckets. But the opposite is the case, I guess I simply produced tones of so called collisions instead of improving anything. But then again, how do the bucket lists work?
A Contains
check basically:
By restricting the number of buckets, you've increased the number of items in each bucket, and thus the number of items that the hashset must iterate through, checking for equality, in order to see if an item exists or not. Thus it takes longer to see if a given item exists.
You've probably decreased the memory footprint of the hashset; you may even have decreased the insertion time, although I doubt it. You haven't decreased the existence-check time.