Search code examples
c#collectionsparameters

Best practice for parameter: IEnumerable vs. IList vs. IReadOnlyCollection


I get when one would return an IEnumerable from a method—when there's value in deferred execution. And returning a List or IList should pretty much be only when the result is going to be modified, otherwise I'd return an IReadOnlyCollection, so the caller knows what's being returned isn't intended for modification (and this lets the method even emit reused objects from other callers).

However, on the parameter input side, I'm a little less clear. I could take an IEnumerable, but what if I need to enumerate more than once?

The saying "Be conservative in what you send, be liberal in what you accept" suggests taking an IEnumerable is good, but I'm not really sure.

For example, if there are no elements in the following IEnumerable parameter, a significant amount of work can be saved in this method by checking .Any() first, which requires ToList() before that to avoid enumerating twice.

public IEnumerable<Data> RemoveHandledForDate(IEnumerable<Data> data, DateTime dateTime) {
   var dataList = data.ToList();

   if (!dataList.Any()) {
      return dataList;
   }

   var handledDataIds = new HashSet<int>(
      GetHandledDataForDate(dateTime) // Expensive database operation
         .Select(d => d.DataId)
   );

   return dataList.Where(d => !handledDataIds.Contains(d.DataId));
}

So I'm wondering what is the best signature, here? One possibility is IList<Data> data, but accepting a list suggests that you plan to modify it, which is not correct—this method doesn't touch the original list, so IReadOnlyCollection<Data> seems better.

But IReadOnlyCollection forces callers to do ToList().AsReadOnly() every time which gets a bit ugly, even with a custom extension method .AsReadOnlyCollection. And that's not being liberal in what is accepted.

What is best practice in this situation?

This method is not returning an IReadOnlyCollection because there may be value in the final Where using deferred execution as the whole list is not required to be enumerated. However, the Select is required to be enumerated because the cost of doing .Contains would be horrible without the HashSet.

I don't have a problem with calling ToList, it just occurred to me that if I need a List to avoid multiple enumeration, why do I not just ask for one in the parameter? So the question here is, if I don't want an IEnumerable in my method, should I really accept one in order to be liberal (and ToList it myself), or should I put the burden on the caller to ToList().AsReadOnly()?

Further Information for those unfamiliar with IEnumerables

The real problem here is not the cost of Any() vs. ToList(). I understand that enumerating the entire list costs more than doing Any(). However, assume the case that the caller will consume all items in the return IEnumerable from the above method, and assume that the source IEnumerable<Data> data parameter comes from the result of this method:

public IEnumerable<Data> GetVeryExpensiveDataForDate(DateTime dateTime) {
    // This query is very expensive no matter how many rows are returned.
    // It costs 5 seconds on each `.GetEnumerator` call to get 1 value or 1000
    return MyDataProvider.Where(d => d.DataDate == dateTime);
}

Now if you do this:

var myData = GetVeryExpensiveDataForDate(todayDate);
var unhandledData = RemoveHandledForDate(myData, todayDate);
foreach (var data in unhandledData) {
   messageBus.Dispatch(data); // fully enumerate
)

And if RemovedHandledForDate does Any and does Where, you'll incur the 5 second cost twice, instead of once. This is why you should always take extreme pains to avoid enumerating an IEnumerable more than once. Do not rely on your knowledge that in fact it's harmless, because some future hapless developer may call your method some day with a newly implemented IEnumerable you never thought of, which has different characteristics.

The contract for an IEnumerable says that you can enumerate it. It does NOT promise anything about the performance characteristics of doing so more than once.

In fact, some IEnumerables are volatile and won't return any data upon a subsequent enumeration! Switching to one would be a totally breaking change if combined with multiple enumeration (and a very hard to diagnose one if the multiple enumeration was added later).

Don't do multiple enumeration of an IEnumerable.

If you accept an IEnumerable parameter, you are in effect promising to enumerate it exactly 0 or 1 times.


Solution

  • You can take an IEnumerable<T> in the method, and use a CachedEnumerable similar to the one here to wrap it.

    This class wraps an IEnumerable<T> and makes sure that it is only enumerated once. If you try to enumerate it again, it yield items from the cache.

    Please note that such wrapper does not read all items from the wrapped enumerable immediately. It only enumerates individual items from the wrapped enumerable as you enumerate individual items from the wrapper, and it caches the individual items along the way.

    This means that if you call Any on the wrapper, only a single item will be enumerated from the wrapped enumerable, and then such item will be cached.

    If you then use the enumerable again, it will first yield the first item from the cache, and then continue enumerating the original enumerator from where it left.

    You can do something like this to use it:

    public IEnumerable<Data> RemoveHandledForDate(IEnumerable<Data> data, DateTime dateTime)
    {
        var dataWrapper = new CachedEnumerable(data);
        ...
    }
    

    Notice here that the method itself is wrapping the parameter data. This way, you don't force consumers of your method to do anything.