Search code examples
c#linqienumerablemorelinq

IEnumerable batching discrepancy in Count() value


I am trying to batch the IEnumerable<T> in equal subsets and came across following solutions:

  1. MoreLinq Nuget library Batch, whose implementation is detailed here:

    MoreLinq - Batch, pasting source code underneath:

    public static IEnumerable<TResult> Batch<TSource, TResult>(this   
      IEnumerable<TSource> source, int size,
            Func<IEnumerable<TSource>, TResult> resultSelector)
     {
        if (source == null) throw new ArgumentNullException(nameof(source));
        if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
        if (resultSelector == null) throw new ArgumentNullException(nameof(resultSelector));
          return BatchImpl(source, size, resultSelector);
     }
    
    private static IEnumerable<TResult> BatchImpl<TSource, TResult> (this IEnumerable<TSource> source, int       
              size,Func<IEnumerable<TSource>, TResult> resultSelector)
    {
        Debug.Assert(source != null);
        Debug.Assert(size > 0);
        Debug.Assert(resultSelector != null);
    
       TSource[] bucket = null;
       var count = 0;
    
    foreach (var item in source)
    {
        if (bucket == null)
        {
            bucket = new TSource[size];
        }
    
        bucket[count++] = item;
    
        // The bucket is fully buffered before it's yielded
        if (count != size)
        {
            continue;
        }
    
        // Select is necessary so bucket contents are streamed too
        yield return resultSelector(bucket);
    
        bucket = null;
        count = 0;
    }
    
    // Return the last bucket with all remaining elements
    if (bucket != null && count > 0)
    {
        Array.Resize(ref bucket, count);
            yield return resultSelector(bucket);
    }
    }
    
  2. Another optimal solution is available on the following link (more memory efficient):

    IEnumerable Batching, pasting source code underneath:

    public static class BatchLinq
    {
         public static IEnumerable<IEnumerable<T>> CustomBatch<T>(this IEnumerable<T> source, int size)
        {
          if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");
    
         using (IEnumerator<T> enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
               yield return TakeIEnumerator(enumerator, size);
       }
    
       private static IEnumerable<T> TakeIEnumerator<T>(IEnumerator<T> source, int size)
      {
          int i = 0;
          do
              yield return source.Current;
          while (++i < size && source.MoveNext());
      }
    }
    

Both the solutions provide the end result as IEnumerable<IEnumerable<T>>.

I find the discrepancy in the following piece of code:

var result = Fetch IEnumerable<IEnumerable<T>> from either method suggested above

result.Count(), leads to different result, its correct for MoreLinq Batch, but not correct for other one, even when the Result is correct and same for both

Consider the follwing example:

IEnumerable<int> arr = new int[10] {1,2,3,4,5,6,7,8,9,10};

For a Partition size 3

arr.Batch(3).Count(), will provide result 4 which is correct

arr.BatchLinq(3).Count(), will provide result 10 which is incorrect

Even when the batching result provided is correct, when we do ToList(), is the error since we are still dealing with the memory stream in the second method and memory is not allocated, but still incorrect result shall not be the case, Any views / suggestions


Solution

  • The reason why second result return Count=10 is because it uses while (enumerator.MoveNext()) which will yield 10 times and causes resulting enumerable to contain 10 enumerables instead of 3.

    Answer with higher score https://stackoverflow.com/a/13731854/2138959 in referenced question provided reasonable solution to the problem as well.