Search code examples
sqlsql-serverpartitioningcumulative-sumsql-server-2022

How to partition into batch of records with max batch size limit


I want to batch the students into maximum batch size of 100, partitioned by School. Each teacher can have 12 students. Students of any teacher should not be split into different batches. No. of students in a batch can be less than 100 but not more.

Fiddle URL

With the query used in the fiddle, for the second batch the number students are more than 100 (RowNum 17-32). How can I ensure the students are not split into different batches and there are no more than 100 students in a batch.


Solution

  • With the fiddle (and a better explanation) available, it becomes clear that what you have is a (somewhat) simplified version of the Knapsack Problem.

    As such, it cannot be solved without recursion, unfortunately. This means that, on larger datasets, you might run into resource exhaustion issues.

    The following query builds upon your fiddle:

    declare @BatchSize int = 100;
    
    with src as (
        -- Source query
        select t.School, t.TeacherName,
            count(*) as [GroupSize],
            dense_rank() over(partition by t.School order by t.TeacherName) as [RN]
        from #mytable t
        group by t.School, t.TeacherName
    ),
    cte as (
        -- Anchor part - start of the first batch for each school
        select s.School, s.TeacherName, s.GroupSize, s.RN,
            s.GroupSize as [BatchTotal], 1 as [IsBatchStart]
        from src s
        where s.RN = 1
        union all
        -- Recursive part - deciding if current batch can be continued or not
        select s.School, s.TeacherName, s.GroupSize, s.RN,
            case
                when c.BatchTotal + s.GroupSize > @BatchSize then s.GroupSize
                else c.BatchTotal + s.GroupSize
            end as [BatchTotal],
            case
                when c.BatchTotal + s.GroupSize > @BatchSize then 1
                else 0
            end as [IsBatchStart]
        from src s
            inner join cte c on c.School = s.School
                and s.RN = c.RN + 1
    )
    select c.*,
        sum(c.IsBatchStart) over(partition by c.School order by c.RN) as [BatchNumber]
    from cte c
    order by c.School, c.RN
    -- Infinite iteration - might be needed on large datasets
    -- option (maxrecursion 0)
    

    Also note that dense_rank() is a rather expensive function, performance-wise. On a larger scale I'd suggest to cache the src subquery into a temporary table with suitable indices, and then use something more lightweight, like row_number(), to number the rows.