Search code examples
c#web-crawlertask-parallel-librarylinq-to-objectsplinq

ParallelEnumerable.GroupBy: how lazy it is?


MSDN says that ParallelEnumerable.GroupBy groups in parallel the elements of a sequence according to a specified key selector function. So my question is: How lazy it is ?

It's clear that ParallelQuery<IGrouping<,>> is lazy. But what about IGrouping<> itself, is it lazy as well ?

So, if I do the following:

var entities = sites.AsParallel()
                         .Select(x => GetDataItemsFromWebsiteLazy(x))
                         .SelectMany(x => x)
                         .GroupBy(dataItem => dataItem.Url.Host)
                         .AsParallel()
                             .SelectMany(x => TransformToEntity(x));

Will TransformToEntity be called first time after all sites will fetch results?
Or as soon as first GetDataItemsFromWebsiteLazy() method will yield return an element?

The point of all that is to fire requests to different hosts in parallel.
Data processing goes as follows. For every website in a set:

  1. Request website
  2. Parse response and extract another site url
  3. Request site by extracted url
  4. Parse response and create entity from obtained data

Solution

  • The GroupBy extension is, in fact, not lazy at all (or, more accurately, not deferred at all), as can be easily demonstrated with the following test program:

    void Main()
    {
        var source = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }.AsParallel();
        var groupEven = GetEvenNumbersUsingGroupBy(source);
        // foreach (int num in groupEven) { }
    }
    
    IEnumerable<int> GetEvenNumbersUsingGroupBy(IEnumerable<int> source)
    {
        Console.WriteLine("Method called: GetEvenNumbersUsingGroupBy");
        var grouped = source.GroupBy(i => i % 2);
        return grouped.Where(g => g.Key == 0).Single();
    }
    

    This program outputs the following:

    Method called: GetEvenNumbersUsingGroupBy

    Meaning that even though we never actually iterate the result of the GetEvenNumbersUsingGroupBy method, it still gets executed.

    This is in contrast to a normal deferred enumerable using the yield statement, as in:

    void Main()
    {
        var yieldEven = GetEvenNumbersUsingYield(source);
        foreach (int num in yieldEven) { }
        foreach (int num in yieldEven) { }
    }
    
    IEnumerable<int> GetEvenNumbersUsingYield(IEnumerable<int> source)
    {
        Console.WriteLine("Method called: GetEvenNumbersUsingYield");
        foreach (int i in source)
            if ((i % 2) == 0)   
                yield return i;
    }
    

    This prints the following:

    Method called: GetEvenNumbersUsingYield
    Method called: GetEvenNumbersUsingYield

    In other words, each time you iterate the results, they are re-evaluated, which is a typical characteristic of deferred evaluation (as opposed to straight-up lazy loading which caches the result after the first evaluation).

    Note that this is the same whether you use AsParallel or not; it's a characteristic of the GroupBy extension (which by definition needs to build a hash table or other kind of lookup in order to store the individual groups) and wholly independent of concurrency.

    It's easy to see why this is the case if you think about how you would implement a deferred grouping function; in order to iterate all of the elements of a single group, you would have to iterate the entire sequence to be sure that you've actually covered all of the elements of that group. So while it might technically be possible to defer this one-time iteration of the entire sequence, it's probably not worth it in most cases, since it's going to have the exact same memory and CPU characteristics as the eagerly-loaded version.