Context: I have an application that allows the user to process all the mailed-in payments received that day. Sometimes an envelope may include multiple checks for the same account (think two roommates each paying their portion of a utility bill).
Restrictions: Process all the payments in batches of 10 but the Account ID must be unique per batch.
Very simplified Payment class:
public class Payment
{
public int AccountId { get; set; }
// ... other properties not important
}
A hypothetical collection of payments received in the mail today. Notice that the last two AccountId values are acceptable duplicates:
List<Payment> payments = new List<Payment>()
{
new Payment() {AccountId = 1 },
new Payment() {AccountId = 2 },
new Payment() {AccountId = 3 },
new Payment() {AccountId = 4 },
new Payment() {AccountId = 5 },
new Payment() {AccountId = 1 }, // Duplicate Account
new Payment() {AccountId = 2 } // Duplicate Account
// likely hundreds more unique accounts, possibly even some more duplicates...
};
I'm using MoreLinq to try to select distinct accounts per batch but this code below is clearly not going to work. I feel like I'm close but have been unable to find a working solution. Again, the goal is to split all the payments into batches of N without duplicating the AccountId in that batch. Duplicate AccountIds must be spread across other batches so they don't cause a race condition when trying to update the customer's balance.
Edited code comments for clarity.
int batchSize = 10;
var paymentTasks = new List<Task>(batchSize);
// This linq expression is the heart of my question: How to divide the payments
// into batches while ensuring uniqueness of a particular key(s). This expression
// is close, but the DistinctBy() is obviously excluding the duplicates that
// I just intend to be distinct for that Batch(batchSize).
foreach (IEnumerable<Payment> batchOfPayments in payments.DistinctBy(a => a.AccountId).Batch(batchSize))
{
// The rest of this method is for context only
paymentTasks.Clear();
foreach (Payment payment in batchOfPayments)
{
// Async method implementation not important
Task paymentTask = ProcessPaymentAsync(payment);
paymentTasks.Add(paymentTask);
}
// Await all the tasks in this batch to complete before starting the next batch
await Task.WhenAll(paymentTasks);
}
Thank you for your time and for looking at my question.
If I completely understand the problem, then there are many ways to do this and the best solution would depend on your actual needs.
The assumptions are :
Then a very simple (yet efficient) batch and queue pattern can be used with minimal allocations.
Given
public class Payment
{
public int AccountId { get; set; }
public Payment(int accountId) => AccountId = accountId;
}
And
public static IEnumerable<Payment[]> GetBatches(IEnumerable<Payment> source, int count)
{
var hashset = new HashSet<int>(count);
var batch = new List<Payment>(count);
var leftOvers = new Queue<Payment>();
while (true)
{
foreach (var item in source)
{
// check if batched
if (hashset.Add(item.AccountId))
batch.Add(item); // add to batch
else
leftOvers.Enqueue(item); // add to left overs
// if we are at the batch size start a loop
while (batch.Count == count)
{
yield return batch.ToArray(); // return the batch
batch.Clear();
hashset.Clear();
// check the left overs
while (leftOvers.Any() && batch.Count != count)
if (hashset.Add(leftOvers.Peek().AccountId)) // check the batch
batch.Add(leftOvers.Dequeue());
else break; // we still have a duplicate bail
}
}
if(batch.Any()) yield return batch.ToArray();
if (!leftOvers.Any()) break;
source = leftOvers.ToList(); // allocation :(
hashset.Clear();
batch.Clear();
leftOvers.Clear();
}
}
Note : This is fairly resource efficient, though it does probably have an extra unnecessary small allocation when dealing with pure leftovers, I am sure this could be removed, though I'll leave that up to you. There are also many efficiencies you could add with the use of a channel could easily be turned into a consumer
Test
var list = new List<Payment>() {new(1), new(2), new(3), new(4), new(4), new(5), new(6), new(4), new(4), new(6), new(4)};
var batches = GetBatches(list, 3);
foreach (var batch in batches)
Console.WriteLine(string.Join(", ",batch.Select(x => x.AccountId)));
Output
1, 2, 3
4, 5, 6
4, 6
4
4
4