Search code examples
c#.netperformancelinq

How efficient are chained LINQ statements?


I have a case where I want to iterate all but the last 2 elements of a collection.

Let's say I do it a weird way like x.Reverse().Skip(2).Reverse().

Will each LINQ operation generate effectively a nested iterator or cause enumeration, etc or is it smarter than that? What happens under the hood in this sort of case?


Clarification: this is just one example of chained LINQ statements you might see, where a developer favours short powerful code without thinking too hard about performance - maybe they are a Computer Science student and this seems the 'cleverest' solution. I am not asking how to solve this particular example


Solution

  • First off yes that is creating a "iterator" and does not actually do any iterating until you materialized the query in a foreach or by calling ToList on it. When you do that the number of iterations that occur are based on the underlying type. Reverse will create a buffer array for whatever source you give it and iterate over it backwards. If the source is ICollection<T> then it will use its CopyTo method to populate the array which should usually result in a single bulk copy of contiguous data in constant time. If it's not a ICollection<T> then it will iterate the source into the buffer and then iterate over it backwards. With that in mind here's what happens for your specific query when you iterate.

    First the last Reverse will start iterating its source (which is not an ICollection<T>).

    Then the Skip will start iterating its source

    Then the first Reverse will either do a CopyTo if its source is a ICollection<T> or it will iterate the source into a buffer array that it resizes as needed.

    Then the first Reverse will iterate over its buffer array backwards

    Then the Skip will take the results skipping the first two and yielding the rest

    Then the last Reverse will take the result and add them to its buffer array and resize it as needed.

    Finally the last Reverse will iterate the buffer array backwards.

    So if you're dealing with a ICollecion<T> that's one CopyTo and then 1 iteration of all of the values and then 1 iteration of all but 2 of the values. If it's not a ICollection<T> that's basically 3 iterations of the values (really the last iteration is of all but 2). And either way it's also using two intermediate arrays in the process.

    And to prove that the query does no iterations until you materialize it you can check out this example

    void Main()
    {
        var query = MyValues().Reverse().Skip(2).Reverse();
        Console.WriteLine($"After query before materialization");
        var results = query.ToList();
        Console.WriteLine(string.Join(",", results));
    }
    
    public IEnumerable<int> MyValues()
    {
        for(int i = 0; i < 10; i ++)
        {
            Console.WriteLine($"yielding {i}");
            yield return i;
        }
    }
    

    Which produces the output

    After query before materialization
    yielding 0
    yielding 1
    yielding 2
    yielding 3
    yielding 4
    yielding 5
    yielding 6
    yielding 7
    yielding 8
    yielding 9
    0,1,2,3,4,5,6,7
    

    When compared to the other example you had x.Take(x.Count() - 2), that will iterate the source before you materialize it once for the Count (unless it's ICollection or ICollection<T> in which case it will just use the Count property) then it will iterate it again when you materialize it.

    Here's the same example with the different code and the resulting output.

    void Main()
    {
        var x = MyValues();
        var query = x.Take(x.Count() - 2);
        Console.WriteLine($"After query before materialization");
        var results = query.ToList();
        Console.WriteLine(string.Join(",", results));
    }
    
    public IEnumerable<int> MyValues()
    {
        for(int i = 0; i < 10; i ++)
        {
            Console.WriteLine($"yielding {i}");
            yield return i;
        }
    }
    

    With the output

    yielding 0
    yielding 1
    yielding 2
    yielding 3
    yielding 4
    yielding 5
    yielding 6
    yielding 7
    yielding 8
    yielding 9
    After query before materialization
    yielding 0
    yielding 1
    yielding 2
    yielding 3
    yielding 4
    yielding 5
    yielding 6
    yielding 7
    0,1,2,3,4,5,6,7
    

    So which one is better is completely dependent on the source. For a ICollection<T> or ICollection the Take and Count would be preferred (unless the source might change between when the query is created and when it is materialized), but if it's neither of those you might prefer the double Reverse to avoid iterating the source twice (especially if the source can change between when you create the query and actually materialize it as the size might change as well), but that has to be weighted against the increase in total iterations done and memory usage.