The piece of code below filters an IEnumerable<T>
with another, used as a blacklist. The filtered collection iterates over content fetched remotely (lazy loading, YouTube API).
IEnumerable<CustomType> contentThatCanBeHuge = this.FetchContentThatCanBeHuge();
IEnumerable<string> blackListContent = this.FetchBlackListContent();
return contentThatCanBeHuge.Where(x => !blackListContent.Contains(x.Id));
The Enumerable.Contains
method is O(n) in time complexity, so the Enumerable.Where
call could take a while.
In the other hand, HashSet<T>.Contains
is O(1). Instantiating a HashSet<T>
from an IEnumerable<T>
seems to be O(n).
If the blacklist is about to be used multiple times, and without taking space complexity into account, is it a good approach to turn it into a HashSet<T>
before using it or is this just premature optimization?
Let size of blackListContent
be m
and size of contentThatCanBeHuge
is n
.
If we don't use HashSet
, time complexity is O(n * O(m))
= O(n * m)
, space complexity is O(1)
: for each item in contentThatCanBeHuge
we should scan entire blackListContent
.
If we use HashSet
, time complexity is O(m) + O(n * O(1))
= O(n + m)
, space complexity is O(m)
:
HashSet
- O(m)
time complexity, O(m)
space complexity.contentThatCanBeHuge
we should check it with rspect of HashSet - O(n * O(1))
time complexity.So far so good HashSet
makes the code faster but we consumes more memory.