Search code examples
c#linqjoininner-joinhashset

Need help understanding unexpected behavior using LINQ Join with HashSet<T>


I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.

I have the following:

 private HashSet<MyClass> _mySet; // module level

 IEnumerable<ISearchKey> searchKeys; // parameter.
 // Partial key searches are allowed.

 private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
 // Compares instances of MyClass and ISearchKey to determine 
 // if they match.

Given that

  1. There is a 1-to-many relationship between searchKeys and _mySet.
  2. MyClass implements interface IPartialKey and ICoreKey.
  3. ISearchKey inherits from IPartialKey and ICoreKey.
  4. MyClass and ISearchKey instance both override the GetHashCode method.
  5. MyClass's hash code value is based on its full key value, which includes its ICoreKey and IPartialKey values plus other fields.
  6. The full keys used by MyClass are not unique. Two different MyClass instances can have the same hash code.
  7. ISearchKey's hash code value is based only on its ICoreKey and IPartialKey values. i.e. The ISearchKey hash code might differ from the hash code for a matching MyClass instance. (Side note: in the case where I first encountered the problem, the ISearchKey's IPartialKey values match the MyClass full key, so the GetHashCode methods would return the same value for both ISearchKey and MyClass. I included the additional complexity to better illustrate the underlying logic on what I am doing.)
  8. The _coreKeyComparer.GetHashCode method returns the same value for matching instances of ISearchKey and MyClass using only their ICoreKey values.
  9. The _coreKeyComparer.Equals method cast the parameters to MyClass and ISearchKey respectively and returns true if their IPartialKey values match. (Side note: _coreKeyComparer has been HEAVILY tested and works correctly.)

I expected a join between the two collections should result in something like:

{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....

i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.

But when I do a join from searchKeys to _mySet:

        var matchedPairs = searchKeys
          .Join(
            _mySet,
            searchKey => searchKey,
            myClass => myClass,
            (searchKey, myClass) => new {searchKey, myClass},
            _coreKeyComparer)
            .ToList();

I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:

    {searchKey_a, myClass_a1},
    {searchKey_b, myClass_b1},
    {searchKey_c, myClass_c1},
etc....

However if I reverse the join, go from _mySet to searchKeys:

   var matchedPairs = _mySet
          .Join(
            searchKeys,
            myClass => myClass,
            searchKey => searchKey,
            (myClass, searchKey) => new {searchKey, myClass},
            _coreKeyComparer)
            .ToList();

I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.

I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.

(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)

Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.

If the former, I need clarification so I don't make mistakes using Join in the future.

If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?

Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.


Solution

  • Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.

    Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

    Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join does, and perhaps that will help.

    Join takes the following:

    • An "outer" collection -- the receiver of the Join.
    • An "inner" collection -- the first argument of the extension method
    • Two key extractors, that extract a key from the outer and inner collections
    • A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
    • A comparison operation that compares two keys for equality.

    Here's how Join works. (This is logically what happens; the actual implementation details are somewhat optimized.)

    First, we iterate over the "inner" collection, exactly once.

    For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.

    Thus, we now have a lookup from TKey to IEnumerable<TInner>.

    Second, we iterate over the "outer" collection, exactly once.

    For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.

    We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.

    That is, Join behaves like this pseudocode implementation:

    static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
      (IEnumerable<TOuter> outer, 
      IEnumerable<TInner> inner, 
      Func<TOuter, TKey> outerKeySelector, 
      Func<TInner, TKey> innerKeySelector, 
      Func<TOuter, TInner, TResult> resultSelector, 
      IEqualityComparer<TKey> comparer) 
    {
      var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
      foreach(TInner innerItem in inner)
      {
        TKey innerKey = innerKeySelector(innerItem);
        lookup.Add(innerItem, innerKey);
      }
      foreach (TOuter outerItem in outer) 
      {
        TKey outerKey = outerKeySelector(outerItem);
        foreach(TInner innerItem in lookup[outerKey])
        {
          TResult result = resultSelector(outerItem, innerItem);
          yield return result;
        }
      }
    }
    

    Some suggestions:

    • Replace all your GetHashCode implementations so that they return 0, and run all your tests. They should pass! It is always legal to return zero from GetHashCode. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value of GetHashCode, then you have a bug.
    • Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of A and B must be the same as B and A, (3) transitivity: if A equals B and B equals C then A must equal C. If these rules are not met then Join can behave weirdly.
    • Replace your Join with a SelectMany and a Where. That is:

      from o in outer join i in inner on getOuterKey(o) equals getInnerKey(i) select getResult(o, i)

    can be rewritten as

    from o in outer
    from i in inner
    where keyEquality(getOuterKey(o), getInnerKey(i))
    select getResult(o, i)
    

    That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.

    Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.