I have a case where I want to iterate all but the last 2 elements of a collection.
Let's say I do it a weird way like x.Reverse().Skip(2).Reverse()
.
Will each LINQ operation generate effectively a nested iterator or cause enumeration, etc or is it smarter than that? What happens under the hood in this sort of case?
Clarification: this is just one example of chained LINQ statements you might see, where a developer favours short powerful code without thinking too hard about performance - maybe they are a Computer Science student and this seems the 'cleverest' solution. I am not asking how to solve this particular example
First off yes that is creating a "iterator" and does not actually do any iterating until you materialized the query in a foreach
or by calling ToList
on it. When you do that the number of iterations that occur are based on the underlying type. Reverse
will create a buffer array for whatever source you give it and iterate over it backwards. If the source is ICollection<T>
then it will use its CopyTo
method to populate the array which should usually result in a single bulk copy of contiguous data in constant time. If it's not a ICollection<T>
then it will iterate the source into the buffer and then iterate over it backwards. With that in mind here's what happens for your specific query when you iterate.
First the last Reverse
will start iterating its source (which is not an ICollection<T>
).
Then the Skip
will start iterating its source
Then the first Reverse will either do a CopyTo
if its source is a ICollection<T>
or it will iterate the source into a buffer array that it resizes as needed.
Then the first Reverse will iterate over its buffer array backwards
Then the Skip will take the results skipping the first two and yielding the rest
Then the last Reverse will take the result and add them to its buffer array and resize it as needed.
Finally the last Reverse will iterate the buffer array backwards.
So if you're dealing with a ICollecion<T>
that's one CopyTo
and then 1 iteration of all of the values and then 1 iteration of all but 2 of the values. If it's not a ICollection<T>
that's basically 3 iterations of the values (really the last iteration is of all but 2). And either way it's also using two intermediate arrays in the process.
And to prove that the query does no iterations until you materialize it you can check out this example
void Main()
{
var query = MyValues().Reverse().Skip(2).Reverse();
Console.WriteLine($"After query before materialization");
var results = query.ToList();
Console.WriteLine(string.Join(",", results));
}
public IEnumerable<int> MyValues()
{
for(int i = 0; i < 10; i ++)
{
Console.WriteLine($"yielding {i}");
yield return i;
}
}
Which produces the output
After query before materialization
yielding 0
yielding 1
yielding 2
yielding 3
yielding 4
yielding 5
yielding 6
yielding 7
yielding 8
yielding 9
0,1,2,3,4,5,6,7
When compared to the other example you had x.Take(x.Count() - 2)
, that will iterate the source before you materialize it once for the Count
(unless it's ICollection
or ICollection<T>
in which case it will just use the Count
property) then it will iterate it again when you materialize it.
Here's the same example with the different code and the resulting output.
void Main()
{
var x = MyValues();
var query = x.Take(x.Count() - 2);
Console.WriteLine($"After query before materialization");
var results = query.ToList();
Console.WriteLine(string.Join(",", results));
}
public IEnumerable<int> MyValues()
{
for(int i = 0; i < 10; i ++)
{
Console.WriteLine($"yielding {i}");
yield return i;
}
}
With the output
yielding 0
yielding 1
yielding 2
yielding 3
yielding 4
yielding 5
yielding 6
yielding 7
yielding 8
yielding 9
After query before materialization
yielding 0
yielding 1
yielding 2
yielding 3
yielding 4
yielding 5
yielding 6
yielding 7
0,1,2,3,4,5,6,7
So which one is better is completely dependent on the source. For a ICollection<T>
or ICollection
the Take
and Count
would be preferred (unless the source might change between when the query is created and when it is materialized), but if it's neither of those you might prefer the double Reverse
to avoid iterating the source twice (especially if the source can change between when you create the query and actually materialize it as the size might change as well), but that has to be weighted against the increase in total iterations done and memory usage.