Search code examples
c#.netperformancehashsetcyclic-reference

HashSet: Why valueType.Equals(Object) is so slow


Having this method:

public void RunFD(object fileObj, ISet<object> visited)
{
    if (!visited.Add(fileObj)) return;

    if (fileObj is IEnumerable enumerablFileObj )
    {
        foreach (var aFileObj in enumerablFileObj.OfType<object>())
        {
            RunFD(aFileObj , visited);
        }
    }
    else
    {
        AddToFolder(fileObj);
    }
}

fileObj is a simple object that can contain a string or custom object. The scenario where Parent1 -> child -> Parent1 may occur, hence the use of the HashSet named visited. Here, Parent1 refers to the exact same instance as the other Parent1.

Although this method is functioning as expected, I noticed it took significantly long to execute. Surprisingly, the bulk of the execution time was spent on visited.Add(fileObj).

During execution, the most time-consuming operation was observed within HashSet.AddIfNotPresent -> ObjectEqualityComparer.Equals -> ValueType.Equals(Object), which accounted for 95% of the total execution time of my method.

I am puzzled as I anticipated the time complexity for HashSet to be generally O(1), merely checking if the reference object is the same as the other (which aligns with my requirement).

Considering that I have numerous types of classes for fileObj, should I override the Equals method for all of them? Is there a more efficient way to check if the object is the same?

I tried to find another solution, but the only one that came up is to overload the Equals and GetHashCode methods of the object. However, this is very painful as I have many possible objects for this method..


Solution

  • Why valueType.Equals(Object) is so slow

    One of the potential problems to be aware of is that ValueType.Equals can use reflection. If value type does not override equality members and for example contains reference type fields/properties (see the list in this github issue) it will use the reflection to perform the operation:

    // if there are no GC references in this object we can avoid reflection
    // and do a fast memcmp
    if (CanCompareBitsOrUseFastGetHashCode(RuntimeHelpers.GetMethodTable(obj))) // MethodTable kept alive by access to object below
    {
        return //...
    }
     
    FieldInfo[] thisFields = GetType().GetFields(BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
    
    for (int i = 0; i < thisFields.Length; i++)
    {
        object? thisResult = thisFields[i].GetValue(this);
        object? thatResult = thisFields[i].GetValue(obj);
        // ...
    }
    

    Which obviously is relatively costly operation.

    Mitigations - define custom GetHashcode and Equals on the struct or provide custom IEqualityComparer comparer to the set.

    Another point worth investigating is amount of the time the Equals is invoked. But this would require a full reproducer. O(1) is achieved with "good" hash function which results in collisions occurring rarely which may not be the case for your data (or just your data has a lot of duplicate items).