Search code examples
sqlprestorankingdense-rankranking-functions

How to group into batches after assigning rank


I have a table that I'm first trying to group based on unique column values (using dense_rank) and then further group those items into batches of 5. Below is my table:

video_id frame_id verb
video_a frame_1 walk
video_a frame_2 run
video_a frame_3 sit
video_a frame_4 walk
video_a frame_5 walk
video_a frame_6 walk
video_b frame_7 stand
video_b frame_8 stand
video_b frame_9 run
video_b frame_10 run
video_b frame_11 sit
video_b frame_12 run
video_b frame_13 run

And below is what I'm trying to get:

video_id frame_id verb batch_of_five
video_a frame_1 walk 1
video_a frame_2 run 1
video_a frame_3 sit 1
video_a frame_4 walk 1
video_a frame_5 walk 1
video_a frame_6 walk 2
video_b frame_7 stand 3
video_b frame_8 stand 3
video_b frame_9 run 3
video_b frame_10 run 3
video_b frame_11 sit 3
video_b frame_12 run 4
video_b frame_13 run 4

Where each video_id has a unique rank and each batch of 10 within each ranked video_id has its own unique rank (and each batch of 10 overall has a unique id regardless of whether they belong to the same video_id or not).

I'm able to group based on the video_id column but am having trouble grouping those items further so that they are both in batches of 10 and unique across all video_ids. I thought about using a group by clause but I'm trying to keep the other columns intact as well (verb column).

Here is my presto query so far:

SELECT
    *
FROM (
    SELECT
        *,
        -- Give each unique video_id a unique rank
        DENSE_RANK() OVER (ORDER BY video_id) AS video_batch
    FROM videos
)


Solution

  • Calculate frame rank (partition by video_id), divide by 6 (integer division) to get batch number in video_id partition. Then rank again to get absolute batch number:

    with sample_data as(        
    select 'video_a' as video_id, 'frame_1' as frame_id , 'walk' as verb union all 
    select 'video_a', 'frame_2' , 'run'   union all
    select 'video_a', 'frame_3' , 'sit'   union all
    select 'video_a', 'frame_4' , 'walk'  union all
    select 'video_a', 'frame_5' , 'walk'  union all
    select 'video_a', 'frame_6' , 'walk'  union all
    select 'video_b', 'frame_7' , 'stand' union all
    select 'video_b', 'frame_8' , 'stand' union all
    select 'video_b', 'frame_9' , 'run'   union all
    select 'video_b', 'frame_10', 'run'   union all
    select 'video_b', 'frame_11', 'sit'   union all
    select 'video_b', 'frame_12', 'run'   union all
    select 'video_b', 'frame_13', 'run'
    )
    
    select s.*, 
           dense_rank() over(order by video_id, rnk_frame / 6) batch_of_five
    from
    (
    select video_id, frame_id, verb, 
           CAST(regexp_extract(frame_id,'_(\d*)$',1) AS INT) frame_number,
           dense_rank() over(partition by video_id order by CAST(regexp_extract(frame_id,'_(\d*)$',1) AS INT)) rnk_frame
      from sample_data
    )s
    order by video_id, frame_number;
    

    Result:

    video_id    frame_id    verb    frame_number    rnk_frame   batch_of_five
    video_a     frame_1     walk    1                1           1
    video_a     frame_2     run     2                2           1
    video_a     frame_3     sit     3                3           1
    video_a     frame_4     walk    4                4           1
    video_a     frame_5     walk    5                5           1
    video_a     frame_6     walk    6                6           2
    video_b     frame_7     stand   7                1           3
    video_b     frame_8     stand   8                2           3
    video_b     frame_9     run     9                3           3
    video_b     frame_10    run     10               4           3
    video_b     frame_11    sit     11               5           3
    video_b     frame_12    run     12               6           4
    video_b     frame_13    run     13               7           4
    

    I extracted frame_number to sort as integer, not as string, to get the same sort order as in your question (some sorting column is absolutely necessary), if you already have rank as you mentioned in question, you can use it instead.