I encountered some odd behavior using C# HastSet with LINQ's Join method that I don't understand. I've simplified what I am doing to help focus on the behavior I am seeing.
I have the following:
private HashSet<MyClass> _mySet; // module level
IEnumerable<ISearchKey> searchKeys; // parameter.
// Partial key searches are allowed.
private IEqualityComparer<ICoreKey> _coreKeyComparer; // Module level.
// Compares instances of MyClass and ISearchKey to determine
// if they match.
{searchKey_a, myClass_a1},
{searchKey_a, myClass_a2},
{searchKey_a, myClass_a3},
{searchKey_b, myClass_b1},
{searchKey_b, myClass_b2},
{searchKey_c, myClass_c1},
{searchKey_c, myClass_c2},
{searchKey_c, myClass_c3},
{searchKey_c, myClass_c4},
etc....
i.e The same ISearchKey instance would occur multiple times, once for each matching MyClass instance it is joined to.
var matchedPairs = searchKeys
.Join(
_mySet,
searchKey => searchKey,
myClass => myClass,
(searchKey, myClass) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I only get one MyClass instance per searchKeyClass instance. i.e. The matchedPairs collection looks like:
{searchKey_a, myClass_a1},
{searchKey_b, myClass_b1},
{searchKey_c, myClass_c1},
etc....
var matchedPairs = _mySet
.Join(
searchKeys,
myClass => myClass,
searchKey => searchKey,
(myClass, searchKey) => new {searchKey, myClass},
_coreKeyComparer)
.ToList();
I get the correct matchedPairs collection. All the matching records from _mySet are returned along with the searchKey which they matched against.
I checked the documentation and examined multiple examples and don't see any reason why the searchKeys-to-_mySet Join gives the wrong answer, while the _mySet-to-searchKeys gives the correct/different answer.
(Side note: I also tried GroupJoin from searchKeys to _myset and go similiar results. i.e. Each searchKeyClass instance found at most one result from _mySet.)
Either I don't understand how the Join method is supposed to work, or Join works differently with HashSet than it does with List or other type of collection.
If the former, I need clarification so I don't make mistakes using Join in the future.
If the latter, then is this differing behavior a .Net bug, or is this the correct behavior with HashSet?
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Just to be clear, I've already fixed my code so it return the correct results, I just want to understand why I got incorrect results initially.
Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.
Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.
Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join
does, and perhaps that will help.
Join
takes the following:
Join
.Here's how Join
works. (This is logically what happens; the actual implementation details are somewhat optimized.)
First, we iterate over the "inner" collection, exactly once.
For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.
Thus, we now have a lookup from TKey
to IEnumerable<TInner>
.
Second, we iterate over the "outer" collection, exactly once.
For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.
We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.
That is, Join
behaves like this pseudocode implementation:
static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
(IEnumerable<TOuter> outer,
IEnumerable<TInner> inner,
Func<TOuter, TKey> outerKeySelector,
Func<TInner, TKey> innerKeySelector,
Func<TOuter, TInner, TResult> resultSelector,
IEqualityComparer<TKey> comparer)
{
var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
foreach(TInner innerItem in inner)
{
TKey innerKey = innerKeySelector(innerItem);
lookup.Add(innerItem, innerKey);
}
foreach (TOuter outerItem in outer)
{
TKey outerKey = outerKeySelector(outerItem);
foreach(TInner innerItem in lookup[outerKey])
{
TResult result = resultSelector(outerItem, innerItem);
yield return result;
}
}
}
Some suggestions:
GetHashCode
implementations so that they return 0
, and run all your tests. They should pass! It is always legal to return zero from GetHashCode
. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value of GetHashCode
, then you have a bug. A
and B
must be the same as B
and A
, (3) transitivity: if A
equals B
and B
equals C
then A
must equal C
. If these rules are not met then Join
can behave weirdly.Replace your Join
with a SelectMany
and a Where
. That is:
from o in outer
join i in inner on getOuterKey(o) equals getInnerKey(i)
select getResult(o, i)
can be rewritten as
from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)
That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.
Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.