Search code examples
c#performancelinqplinqbatching

How to segmentate an IList<T> to segments of N size, without creating copies and without memory allocations?


I have a very large collection that implements the generic IList<T> interface and contains tens of millions of elements, and I would like to process them in parallel using PLINQ. I noticed that the overhead of parallelism is quite significant because processing each individual element is very lightweight, so I am searching for ways to chunkify the processing by splitting the IList<T> into small segments. My goal is to have finally something like this:

IList<Person> source = GetAllPersons();

double averageAge = source
    .Segmentate(1000) // Hypothetical operator that segmentates the source
    .AsParallel()
    .Select(segment => segment.Select(person => (double)person.CalculateAge()).Sum())
    .Sum() / source.Count;

I could use the Batch operator from the MoreLinq library, or any of the answers from many related questions, but all of these solutions are allocating multiple small arrays (or lists), and are copying the source into these containers, and I don't want that. In my case I have the additional requirement of keeping the garbage collector idle as much as possible.

I noticed that the .NET has the ArraySegment type, that seems to fit perfectly with my requirements:

// Delimits a section of a one-dimensional array.
public readonly struct ArraySegment<T> : ICollection<T>, IEnumerable<T>,
    IEnumerable, IList<T>, IReadOnlyCollection<T>, IReadOnlyList<T>

I could use this type to implement the allocation-free Segmentate operator like this:

/// <summary>Segmentates the source array into sized segments.</summary>
public static IEnumerable<ArraySegment<T>> Segmentate<T>(this T[] source, int size)
{
    for (int offset = 0; offset < source.Length; offset += size)
    {
        yield return new ArraySegment<T>(source, offset,
            Math.Min(size, source.Length - offset));
    }
}

But I can't use this type because my source is an IList<T> and not an array. And copying it to an array is not really an option, because the source is mutated frequently. And creating new array copies all the time is against my requirements.

So I am searching for a ListSegment<T> type, but as far as I can see it doesn't exist in .NET. Do I have to implement it myself? And if so, how? Or is any other way to segmentate an IList<T> without causing allocations?


Clarification: My source collection is not a List<T>. It is a custom class that implements the IList<T> interface.


Solution

  • You need to implement an ArraySegment<T> equivalent for IList<T>. See implementation below. For optimal performance, consider using spans instead.

    ListSegment<T> Struct

    public readonly struct ListSegment<T> : IList<T>
    {
        public List<T> Items { get; }
        public int Offset { get; }
        public int Count { get; }
    
        public ListSegment(List<T> items, int offset, int count)
        {
            Items = items ?? throw new ArgumentNullException(nameof(items));
            Offset = offset;
            Count = count;
    
            if (items.Count < offset + count)
            {
                throw new ArgumentException("List segment out of range.", nameof(count));
            }
        }
    
        public void CopyTo(T[] array, int index)
        {
            if (Count > 0)
            {
                Items.CopyTo(Offset, array, index, Count);
            }
        }
    
        public bool Contains(T item) => IndexOf(item) != -1;
    
        public int IndexOf(T item)
        {
            for (var i = Offset; i < Offset + Count; i++)
            {
                if (Items[i].Equals(item))
                {
                    return i;
                }
            }
    
            return -1;
        }
    
        private T ElementAt(int index)
        {
            if (Count > 0)
            {
                return Items[Offset + index];
            }
    
            throw new ArgumentOutOfRangeException(nameof(index));
        }
    
        public ListSegmentEnumerator GetEnumerator() => new ListSegmentEnumerator(this);
    
        #region IEnumerable<T> interface
        IEnumerator<T> IEnumerable<T>.GetEnumerator() => GetEnumerator();
        IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
        #endregion
    
        #region ICollection<T> interface
        bool ICollection<T>.IsReadOnly => true;
    
        void ICollection<T>.Add(T item) => throw new NotImplementedException();
        bool ICollection<T>.Remove(T item) => throw new NotImplementedException();
        void ICollection<T>.Clear() => throw new NotImplementedException();
        #endregion
    
        #region IList<T> interface
        void IList<T>.Insert(int index, T item) => throw new NotImplementedException();
        void IList<T>.RemoveAt(int index) => throw new NotImplementedException();
        T IList<T>.this[int index]
        {
            get => ElementAt(index);
            set => throw new NotImplementedException();
        }
        #endregion
    
        public struct ListSegmentEnumerator : IEnumerator<T>
        {
            private readonly List<T> items;
            private readonly int start;
            private readonly int end;
            private int current;
    
            public ListSegmentEnumerator(ListSegment<T> segment)
            {
                items = segment.Items;
                start = segment.Offset;
                end = start + segment.Count;
                current = start - 1;
            }
    
            public bool MoveNext()
            {
                if (current < end)
                {
                    current++;
    
                    return current < end;
                }
                return false;
            }
    
            public T Current => items[current];
            object IEnumerator.Current => Current;
    
            void IEnumerator.Reset() => current = start - 1;
            void IDisposable.Dispose() { }
        }
    }