In looking at System.Linq.Enumerable through Reflector i noticed that default iterator used for Select and Where extension methods - WhereSelectArrayIterator - does not implement ICollection interface. If i read code properly this causes some other extension methods, such as Count() and ToList() perform slower:
public static IEnumerable<TResult> Select<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> selector)
{
// code above snipped
if (source is List<TSource>)
{
return new WhereSelectListIterator<TSource, TResult>((List<TSource>) source, null, selector);
}
// code below snipped
}
private class WhereSelectListIterator<TSource, TResult> : Enumerable.Iterator<TResult>
{
// Fields
private List<TSource> source; // class has access to List source so can implement ICollection
// code below snipped
}
public class List<T> : IList<T>, ICollection<T>, IEnumerable<T>, IList, ICollection, IEnumerable
{
public List(IEnumerable<T> collection)
{
ICollection<T> is2 = collection as ICollection<T>;
if (is2 != null)
{
int count = is2.Count;
this._items = new T[count];
is2.CopyTo(this._items, 0); // FAST
this._size = count;
}
else
{
this._size = 0;
this._items = new T[4];
using (IEnumerator<T> enumerator = collection.GetEnumerator())
{
while (enumerator.MoveNext())
{
this.Add(enumerator.Current); // SLOW, CAUSES ARRAY EXPANSION
}
}
}
}
}
I've tested this with results confirming my suspicion:
ICollection: 2388.5222 ms
IEnumerable: 3308.3382 ms
Here's the test code:
// prepare source
var n = 10000;
var source = new List<int>(n);
for (int i = 0; i < n; i++) source.Add(i);
// Test List creation using ICollection
var startTime = DateTime.Now;
for (int i = 0; i < n; i++)
{
foreach(int l in source.Select(k => k)); // itterate to make comparison fair
new List<int>(source);
}
var finishTime = DateTime.Now;
Response.Write("ICollection: " + (finishTime - startTime).TotalMilliseconds + " ms <br />");
// Test List creation using IEnumerable
startTime = DateTime.Now;
for (int i = 0; i < n; i++) new List<int>(source.Select(k => k));
finishTime = DateTime.Now;
Response.Write("IEnumerable: " + (finishTime - startTime).TotalMilliseconds + " ms");
Am i missing something or will this be fixed in future versions of framework?
Thank you for your thoughts.
LINQ to Objects uses some tricks to optimize certain operations. For example, if you chain two .Where
statements together, the predicates will be combined into a single WhereArrayIterator
, so the previous ones can be garbage collected. Likewise, a Where
followed by a Select
will create a WhereSelectArrayIterator
, passing the combined predicates as an argument so that the original WhereArrayiterator
can be garbage collected. So the WhereSelectArrayIterator
is responsible for tracking not only the selector
, but also the combined predicate
that it may or may not be based on.
The source
field only keeps track of the initial list that was given. Because of the predicate, the iteration result will not always have the same number of items as source
does. Since LINQ is intended to be lazily-evaluated, it shouldn't evaluate the source
against the predicate
ahead of time just so that it can potentially save time if someone ends up calling .Count()
. That would cause just as much of a performance hit as calling .ToList()
on it manually, and if the user ran it through multiple Where
and Select
clauses, you'd end up constructing multiple lists unnecessarily.
Could LINQ to Objects be refactored to create a SelectArrayIterator
that it uses when Select
gets called directly on an array? Sure. Would it enhance performance? A little bit. At what cost? Less code reuse means additional code to maintain and test moving forward.
And thus we get to the crux of the vast majority of "Why doesn't language/platform X have feature Y" questions: every feature and optimization has some cost associated with it, and even Microsoft doesn't have unlimited resources. Just like every other company out there, they make judgment calls to determine how often code will be run that performs a Select
on an array and then calls .ToList()
on it, and whether making that run a little faster is worth writing and maintaining another class in the LINQ package.