Search code examples
.netdictionarygethashcode

new KeyValuePair<UInt32, UInt32>(i, j).GetHashCode(); High Rate of Duplicates


In search of a fast composite key for Dictionary I came upon anomaly I cannot understand nor justify.

In limited testing

Dictionary<KeyValuePair<UInt32, UInt32>, string>

is significantly slower (200:1) than

Dictionary<KeyValuePair<UInt16, UInt16>, string>

Test on two loops from 0 to 1000 Populate and then ContainsKey

         Poplulate     ContainsKey  
UInt32    92085         86578  
UInt16     2201           431

The problem is that

new KeyValuePair<UInt32, UInt32>(i, j).GetHashCode();

yields MANY duplicates.
In looping i and j 1024 only 1024 unique hash values are created.

Based on avalanche comment from CasperOne tried i*31 and j*97 (two prime numbers) and that resulted in 105280 unique on 1024X1024. Still a lot of duplicates. CasperOne I know that is not the same as random. But it is not my job to randomize the input. GetHashCode() is supposed to randomize the output.

Why the high number of duplicates?

Same loop on

new KeyValuePair<UInt16, UInt16>(i, j).GetHashCode();

yields 1024 X 1024 unique hash codes (perfect).

Int32 has the same problem.

These duplicate hash values kill

Dictionary<KeyValuePair<UInt32, UInt32>, string> 

Tuple also generates a lot of duplicates it does not degrade at Int32 compared to Int16.

Time for generating the raw KVP and the raw KPV.GetHashCode is similar.

Same anomaly with HashSet.

Dictionary<KeyValuePair<UInt32, UInt32>, string> dKVPu32 = new Dictionary<KeyValuePair<UInt32, UInt32>, string>();
Dictionary<KeyValuePair<UInt16, UInt16>, string> dKVPu16 = new Dictionary<KeyValuePair<UInt16, UInt16>, string>();
KeyValuePair<UInt32, UInt32> kvpUint32;
KeyValuePair<UInt16, UInt16> kvpUint16;
int range = 1000;
Int32 hashCode;
HashSet<Int32> kvpUint32Hash = new HashSet<Int32>();
HashSet<Int32> kvpUint16Hash = new HashSet<Int32>();

Stopwatch sw = new Stopwatch();
sw.Start();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        kvpUint32 = new KeyValuePair<UInt32, UInt32>(i, j);
    }
}
Console.WriteLine("UInt32  raw " + sw.ElapsedMilliseconds.ToString());
//  7
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        kvpUint16 = new KeyValuePair<UInt16, UInt16>(i, j);
    }
}
Console.WriteLine("UInt16  raw " + sw.ElapsedMilliseconds.ToString());
//  6
sw.Restart();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        hashCode = new KeyValuePair<UInt32, UInt32>(i, j).GetHashCode();
        kvpUint32Hash.Add(hashCode);
    }
}
Console.WriteLine("UInt32  GetHashCode " + sw.ElapsedMilliseconds.ToString() + "  unique count " + kvpUint32Hash.Count.ToString());
//  285   1024
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        hashCode = new KeyValuePair<UInt16, UInt16>(i, j).GetHashCode();
        kvpUint16Hash.Add(hashCode);
    }
}
Console.WriteLine("UInt16  GetHashCode " + sw.ElapsedMilliseconds.ToString() + "  unique count " + kvpUint16Hash.Count.ToString());
//  398 1000000
sw.Restart();
Console.ReadLine();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        dKVPu32.Add(new KeyValuePair<UInt32, UInt32>(i, j), String.Format("{0} {1}", i.ToString(), j.ToString()));
    }
}
Console.WriteLine("hsKVPu32 pop " + sw.ElapsedMilliseconds.ToString());
//  92085
sw.Restart();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        if (!dKVPu32.ContainsKey(new KeyValuePair<UInt32, UInt32>(i, j))) Debug.WriteLine("Opps"); ;
    }
}
Console.WriteLine("hsKVPu32 find " + sw.ElapsedMilliseconds.ToString());
//  86578
dKVPu32.Clear();
dKVPu32 = null;
GC.Collect();
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        dKVPu16.Add(new KeyValuePair<UInt16, UInt16>(i, j), String.Format("{0} {1}", i.ToString(), j.ToString()));
    }
}
Console.WriteLine("hsKVPu16 pop " + sw.ElapsedMilliseconds.ToString());
//   2201
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        if (!dKVPu16.ContainsKey(new KeyValuePair<UInt16, UInt16>(i, j))) Debug.WriteLine("Opps"); ;
    }
}
sw.Stop();
Console.WriteLine("hsKVPu16 find " + sw.ElapsedMilliseconds.ToString());
//  431

P.S. The fastest is to pack .E.G. ((UInt32)int1 << 16) | int2;

The hash of first UInt32 column equals hash of KVP of the next two.

2281371105 8 992
2281371104 8 993
2281371107 8 994

2281371145 0 0
2281371147 0 2
2281371149 0 4
2281371151 0 6
2281371137 0 8

2281371144 0 1
2281371146 0 3
2281371148 0 5
2281371150 0 7
2281371136 0 9

2281371144 1 0
2281371145 1 1
2281371146 1 2
2281371147 1 3
2281371148 1 4
2281371149 1 5
2281371150 1 6
2281371151 1 7
2281371136 1 8
2281371137 1 9

2281371147 2 0
2281371146 2 1
2281371144 2 3
2281371151 2 4
2281371150 2 5
2281371149 2 6
2281371148 2 7
2281371139 2 8

The only pattern I have found is that either the sum or difference or the KVP matches.
But could not find a pattern for when to sum and when to subtract.
It is a bad hash so knowing what it is is of little value.


Solution

  • Firstly, we can dispense with the timing aspect of this - it feels to me like this is really just about hash collisions, as obviously those will kill the performance.

    So, the question is really why there are more hash collisions for KeyValuePair<uint, uint> than KeyValuePair<ushort, ushort>. To help find out a bit more about that, I've written the following short program:

    using System;
    using System.Collections.Generic;
    
    class Program
    {
        const int Sample1 = 100;
        const int Sample2 = 213;
    
        public static void Main()
        {
            Display<uint, ushort>();
            Display<ushort, ushort>();
            Display<uint, uint>();
            Display<ushort, uint>();
        }
    
        static void Display<TKey, TValue>()
        {
            TKey key1 = (TKey) Convert.ChangeType(Sample1, typeof(TKey));
            TValue value1 = (TValue) Convert.ChangeType(Sample1, typeof(TValue));
            TKey key2 = (TKey) Convert.ChangeType(Sample2, typeof(TKey));
            TValue value2 = (TValue) Convert.ChangeType(Sample2, typeof(TValue));
    
            Console.WriteLine("Testing {0}, {1}", typeof(TKey).Name, typeof(TValue).Name);
            Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value1).GetHashCode());
            Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value2).GetHashCode());
            Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value1).GetHashCode());
            Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value2).GetHashCode());
            Console.WriteLine();
        }
    }
    

    The output on my machine is:

    Testing UInt32, UInt16
    -1888265981
    -1888265981
    -1888265806
    -1888265806
    
    Testing UInt16, UInt16
    -466800447
    -459525951
    -466800528
    -459526032
    
    Testing UInt32, UInt32
    958334947
    958334802
    958334802
    958334947
    
    Testing UInt16, UInt32
    -1913331935
    -1913331935
    -1913331935
    -1913331935
    

    You can obviously try varying the sample values to see where there are collisions.

    The results of KeyValuePair<ushort, uint> are particularly worrying, and the results of KeyValuePair<ushort, ushort> are surprisingly good.

    In fact, KeyValuePair<ushort, uint> isn't just bad - it's ludicrously bad as far as I can see - I haven't to find any value which doesn't have the same hash code of -1913331935 when running the 64 bit CLR. Running the 32 bit CLR I get a different hash code, but still the same hash code for all values.

    It appears that in .NET 4.5 (which is what I'm running) the default implementation of GetHashCode doesn't just take the first instance field of the struct, as previously documented. I suspect that for at least some types, it just uses the first 4 bytes of memory beyond the header in the boxed value (and there will be boxing for every call here), and that ends up sometimes being just the first field (if that field is a uint), sometimes being more than one field (e.g. for ushort, ushort where both fields can fit "inside" 4 bytes) and sometimes being no fields at all (ushort, uint).

    (Actually, this doesn't explain why you get 1024 different hash codes in the uint, uint case instead of just 1000. I'm still unsure on that.)

    Ultimately, using a value type which doesn't override GetHashCode as a dictionary key seems like it's just a bad idea, unless you've tested to ensure that it's suitable for your specific requirements. There's just too much which is black magic to be confident about it, IMO.