Search code examples
c#linqplinqmorelinq

Select Distinct Count is really slow


I have a loop of about 7000 objects and within the loop I need to get a distinct count of a list of structs. Currently I am using -

foreach (var product in productsToSearch)
{
    Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
    var cumulativeCount = 0;
    productStore.Add(product);
    var orderLinesList = totalOrderLines
        .Where(myRows => productStore.Contains(myRows.Sku))
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        });
    var differences = totalOrderLines.Except(orderLinesList);
    cumulativeCount = totalOrderLinsCount - differences.Select(x => x.OrderId).Distinct().Count();
    cumulativeStoreTable.Rows.Add(product, cumulativeCount);      
    Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);
}

public struct OrderLineStruct
{
    public string OrderId { get; set; }
    public string Sku { get; set; }
}

This is extremely slow when getting the distinct count. Anyone know of a more efficient way of doing this? I have tried using MoreLinq which has a DisctintBy method for Linq but it's not more efficient as I have timed it. I have played around with PLinq but I am a little unsure of where to parallelize this query.

So each iteration of the loop is timed at -
Time elapsed: 00:00:37.1142047 start
Time elapsed: 00:00:37.8310148 end

= 0.7168101 secs * 7000 = 5017.6707 (83.627845 minutes)

Its the Distinct() Count() line which is taking the most time to process (around 0.5 secs). The variable differences has a few hundred thousand OrderLineStruct's so doing any linq queries on this is slow.

UPDATE

I have modified the loop a bit and now it runs in around 10 minutes rather that over 1 hour

foreach (var product in productsToSearch)
{
    var cumulativeCount = 0;
    productStore.Add(product);
    var orderLinesList = totalOrderLines
        .Join(productStore, myRows => myRows.Sku, p => p, (myRows, p) => myRows)
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        });
    totalOrderLines = totalOrderLines.Except(orderLinesList).ToList();
    cumulativeCount = totalOrderLinesCount - totalOrderLines.Select(x => x.OrderId).Distinct().Count();
    cumulativeStoreTable.Rows.Add(product, cumulativeCount);
}

Having a .ToList() on the Except seems to make a difference and now I am removing the already processed orders after each iteration which is increasing performance for each iteration.


Solution

  • You are seeking the problem at the wrong place.

    The orderLinesList, differences and differences.Select(x => x.OrderId).Distinct() are just LINQ to Objects chained query methods with deferred execution, and the Count() method is executing them all.

    Your processing algorithm is highly inefficient. The bottleneck is the orderLinesList query which iterates the whole totalOrderLines list for each product, and this is chained (included) in the Except, Distinct etc. - again, inside the loop, i.e 7000+ times.

    Here is a sample efficient algorithm that IMO does the same:

    Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
    var productInfo =
    (
        from product in productsToSearch
        join line in totalOrderLines on product equals line.Sku into orderLines
        select new { Product = product, OrderLines = orderLines }
    ).ToList();
    var lastIndexByOrderId = new Dictionary<string, int>();
    for (int i = 0; i < productInfo.Count; i++)
    {
        foreach (var line in productInfo[i].OrderLines)
            lastIndexByOrderId[line.OrderId] = i; // Last wins
    }
    int cumulativeCount = 0;
    for (int i = 0; i < productInfo.Count; i++)
    {
        var product = productInfo[i].Product;
        foreach (var line in productInfo[i].OrderLines)
        {
            int lastIndex;
            if (lastIndexByOrderId.TryGetValue(line.OrderId, out lastIndex) && lastIndex == i)
            {
                cumulativeCount++;
                lastIndexByOrderId.Remove(line.OrderId);
            }
        }
        cumulativeStoreTable.Rows.Add(item.Product, cumulativeCount);
        // Remove the next if it was just to support your processing
        productStore.Add(item.Product);
    }
    Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);