Search code examples
c#performancetime-complexityhashsetpremature-optimization

Is it a good practice to instantiate a HashSet from an IEnumerable before using Contains()?


The piece of code below filters an IEnumerable<T> with another, used as a blacklist. The filtered collection iterates over content fetched remotely (lazy loading, YouTube API).

IEnumerable<CustomType> contentThatCanBeHuge = this.FetchContentThatCanBeHuge();
IEnumerable<string> blackListContent = this.FetchBlackListContent();
return contentThatCanBeHuge.Where(x => !blackListContent.Contains(x.Id));

The Enumerable.Contains method is O(n) in time complexity, so the Enumerable.Where call could take a while.

In the other hand, HashSet<T>.Contains is O(1). Instantiating a HashSet<T> from an IEnumerable<T> seems to be O(n).

If the blacklist is about to be used multiple times, and without taking space complexity into account, is it a good approach to turn it into a HashSet<T> before using it or is this just premature optimization?


Solution

  • Let size of blackListContent be m and size of contentThatCanBeHuge is n.

    If we don't use HashSet, time complexity is O(n * O(m)) = O(n * m), space complexity is O(1): for each item in contentThatCanBeHuge we should scan entire blackListContent.

    If we use HashSet, time complexity is O(m) + O(n * O(1)) = O(n + m), space complexity is O(m):

    1. We create HashSet - O(m) time complexity, O(m) space complexity.
    2. For each item in contentThatCanBeHuge we should check it with rspect of HashSet - O(n * O(1)) time complexity.

    So far so good HashSet makes the code faster but we consumes more memory.