Search code examples
c#linqmorelinq

Looping through a List<T> in batches while ensuring items per batch are unique


Context: I have an application that allows the user to process all the mailed-in payments received that day. Sometimes an envelope may include multiple checks for the same account (think two roommates each paying their portion of a utility bill).

Restrictions: Process all the payments in batches of 10 but the Account ID must be unique per batch.

Very simplified Payment class:

public class Payment
{
    public int AccountId { get; set; }
    // ... other properties not important
}

A hypothetical collection of payments received in the mail today. Notice that the last two AccountId values are acceptable duplicates:

List<Payment> payments = new List<Payment>()
{
    new Payment() {AccountId = 1 },
    new Payment() {AccountId = 2 },
    new Payment() {AccountId = 3 },
    new Payment() {AccountId = 4 },
    new Payment() {AccountId = 5 },
    new Payment() {AccountId = 1 }, // Duplicate Account
    new Payment() {AccountId = 2 }  // Duplicate Account

    // likely hundreds more unique accounts, possibly even some more duplicates...
};

I'm using MoreLinq to try to select distinct accounts per batch but this code below is clearly not going to work. I feel like I'm close but have been unable to find a working solution. Again, the goal is to split all the payments into batches of N without duplicating the AccountId in that batch. Duplicate AccountIds must be spread across other batches so they don't cause a race condition when trying to update the customer's balance.

Edited code comments for clarity.

int batchSize = 10;
var paymentTasks = new List<Task>(batchSize);

// This linq expression is the heart of my question: How to divide the payments
// into batches while ensuring uniqueness of a particular key(s). This expression
// is close, but the DistinctBy() is obviously excluding the duplicates that
// I just intend to be distinct for that Batch(batchSize).
foreach (IEnumerable<Payment> batchOfPayments in payments.DistinctBy(a => a.AccountId).Batch(batchSize))
{
    // The rest of this method is for context only

    paymentTasks.Clear();

    foreach (Payment payment in batchOfPayments)
    {
        // Async method implementation not important
        Task paymentTask = ProcessPaymentAsync(payment);
        paymentTasks.Add(paymentTask);
    }

    // Await all the tasks in this batch to complete before starting the next batch
    await Task.WhenAll(paymentTasks);
}

Thank you for your time and for looking at my question.


Solution

  • If I completely understand the problem, then there are many ways to do this and the best solution would depend on your actual needs.

    The assumptions are :

    1. What you have described is an in memory approach
    2. It doesn't need to hit a database
    3. It doesn't need to be producer consumer.

    Then a very simple (yet efficient) batch and queue pattern can be used with minimal allocations.

    Given

    public class Payment
    {
       public int AccountId { get; set; }
       public Payment(int accountId) => AccountId = accountId;
    }
    

    And

    public static IEnumerable<Payment[]> GetBatches(IEnumerable<Payment> source, int count)
    {
       var hashset = new HashSet<int>(count);
       var batch = new List<Payment>(count);
       var leftOvers = new Queue<Payment>();
    
       while (true)
       {
          foreach (var item in source)
          {
             // check if batched
             if (hashset.Add(item.AccountId))
                batch.Add(item); // add to batch
             else
                leftOvers.Enqueue(item); // add to left overs
    
             // if we are at the batch size start a loop
             while (batch.Count == count)
             {
                yield return batch.ToArray(); // return the batch
    
                batch.Clear();
                hashset.Clear();
    
                // check the left overs
                while (leftOvers.Any() && batch.Count != count)
                   if (hashset.Add(leftOvers.Peek().AccountId)) // check the batch
                      batch.Add(leftOvers.Dequeue());
                   else break; // we still have a duplicate bail
             }
          }
    
          if(batch.Any()) yield return batch.ToArray();
    
          if (!leftOvers.Any()) break;
    
          source = leftOvers.ToList(); // allocation  :(
          hashset.Clear();
          batch.Clear();
          leftOvers.Clear();
       }
    }
    

    Note : This is fairly resource efficient, though it does probably have an extra unnecessary small allocation when dealing with pure leftovers, I am sure this could be removed, though I'll leave that up to you. There are also many efficiencies you could add with the use of a channel could easily be turned into a consumer

    Test

    var list = new List<Payment>() {new(1), new(2), new(3), new(4), new(4), new(5), new(6), new(4), new(4), new(6), new(4)};
    
    var batches = GetBatches(list, 3);
    
    foreach (var batch in batches)
       Console.WriteLine(string.Join(", ",batch.Select(x => x.AccountId)));
    

    Output

    1, 2, 3
    4, 5, 6
    4, 6
    4
    4
    4
    

    Full demo here to Play with