Search code examples
c#linq.net-coreienumerableyield-return

Understanding behaviour of custom Linq Chunk and IEnumerable<IEnumerable<T>>


I tried to implement custom Linq Chunk function and found this code example This function should separate IEnumerable into IEnumerable of concrete size

public static class EnumerableExtentions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        using (var enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                int i = 0;
                IEnumerable<T> Batch()
                {
                    do yield return enumerator.Current;
                    while (++i < size && enumerator.MoveNext());
                }
                yield return Batch();
            }
        }
    }
}

So, I have a question.Why when I try to execute some Linq operation on the result, they are incorrect? For example:

IEnumerable<int> list = Enumerable.Range(0, 10);
Console.WriteLine(list.Batch(2).Count()); // 10 instead of 5

I have an assumption, that it happens because inner IEnumerable Batch() is only triggered when Count() is called, and something goes wrong there, but I don't know what exactly.


Solution

  • I have an assumption, that it happens because inner IEnumerable Batch() is only triggered when Count() is called

    It's the opposite. The inner IEnumerable is not consumed, when you call Count. Count only consumes the outer IEnumerable, which is this one:

    while (enumerator.MoveNext())
    {
        int i = 0;
        IEnumerable<T> Batch()
        {
            // the below is not executed by Count!
            // do yield return enumerator.Current;
            // while (++i < size && enumerator.MoveNext());
        }
        yield return Batch();
    }
    

    So what Count would do is just move the enumerator to the end, and counts how many times it moved it, which is 10.

    Compare that to how the author of this likely have intended this to be used:

    foreach (var batch in someEnumerable.Batch(2)) {
        foreach(var thing in batch) {
            // ...
        }
    }
    

    I'm also consuming the inner IEnumerables using an inner loop, hence running the code inside the inner Batch. This yields the current element, then also moves the source enumerator forward. It yields the current element again before the ++i < size check fails. The outer loop is going to move forward the enumerator again for the next iteration. And that is how you have created a "batch" of two elements.

    Notice that the "enumerator" (which came from someEnumerable) in the previous paragraph is shared between the inner and outer IEnumerables. Consuming either the inner or outer IEnumerable will move the enumerator, and it is only when you consume both the inner and outer IEnumerables in a very specific way, does the sequence of things in the previous paragraph happen, leading to you getting batches.

    In your case, you can consume the inner IEnumerables by calling ToList:

    Console.WriteLine(list.Batch(2).Select(x => x.ToList()).Count()); // 5
    

    While sharing the enumerator here allows the batches to be lazily consumed, it limits the client code to only consume it in very specific ways. In the .NET 6 implementation of Chunk, the batches (chunks) are eagerly computed as arrays:

    public static IEnumerable<TSource[]> Chunk<TSource>(this IEnumerable<TSource> source, int size)
    

    You can do a similar thing in your Batch by calling ToArray() here:

    yield return Batch().ToArray();
    

    so that the inner IEnumerables are always consumed.