I have a table that I'm first trying to group based on unique column values (using dense_rank
) and then further group those items into batches of 5. Below is my table:
video_id | frame_id | verb |
---|---|---|
video_a | frame_1 | walk |
video_a | frame_2 | run |
video_a | frame_3 | sit |
video_a | frame_4 | walk |
video_a | frame_5 | walk |
video_a | frame_6 | walk |
video_b | frame_7 | stand |
video_b | frame_8 | stand |
video_b | frame_9 | run |
video_b | frame_10 | run |
video_b | frame_11 | sit |
video_b | frame_12 | run |
video_b | frame_13 | run |
And below is what I'm trying to get:
video_id | frame_id | verb | batch_of_five |
---|---|---|---|
video_a | frame_1 | walk | 1 |
video_a | frame_2 | run | 1 |
video_a | frame_3 | sit | 1 |
video_a | frame_4 | walk | 1 |
video_a | frame_5 | walk | 1 |
video_a | frame_6 | walk | 2 |
video_b | frame_7 | stand | 3 |
video_b | frame_8 | stand | 3 |
video_b | frame_9 | run | 3 |
video_b | frame_10 | run | 3 |
video_b | frame_11 | sit | 3 |
video_b | frame_12 | run | 4 |
video_b | frame_13 | run | 4 |
Where each video_id
has a unique rank and each batch of 10 within each ranked video_id
has its own unique rank (and each batch of 10 overall has a unique id regardless of whether they belong to the same video_id
or not).
I'm able to group based on the video_id
column but am having trouble grouping those items further so that they are both in batches of 10 and unique across all video_ids
. I thought about using a group by clause but I'm trying to keep the other columns intact as well (verb
column).
Here is my presto query so far:
SELECT
*
FROM (
SELECT
*,
-- Give each unique video_id a unique rank
DENSE_RANK() OVER (ORDER BY video_id) AS video_batch
FROM videos
)
Calculate frame rank (partition by video_id), divide by 6 (integer division) to get batch number in video_id partition. Then rank again to get absolute batch number:
with sample_data as(
select 'video_a' as video_id, 'frame_1' as frame_id , 'walk' as verb union all
select 'video_a', 'frame_2' , 'run' union all
select 'video_a', 'frame_3' , 'sit' union all
select 'video_a', 'frame_4' , 'walk' union all
select 'video_a', 'frame_5' , 'walk' union all
select 'video_a', 'frame_6' , 'walk' union all
select 'video_b', 'frame_7' , 'stand' union all
select 'video_b', 'frame_8' , 'stand' union all
select 'video_b', 'frame_9' , 'run' union all
select 'video_b', 'frame_10', 'run' union all
select 'video_b', 'frame_11', 'sit' union all
select 'video_b', 'frame_12', 'run' union all
select 'video_b', 'frame_13', 'run'
)
select s.*,
dense_rank() over(order by video_id, rnk_frame / 6) batch_of_five
from
(
select video_id, frame_id, verb,
CAST(regexp_extract(frame_id,'_(\d*)$',1) AS INT) frame_number,
dense_rank() over(partition by video_id order by CAST(regexp_extract(frame_id,'_(\d*)$',1) AS INT)) rnk_frame
from sample_data
)s
order by video_id, frame_number;
Result:
video_id frame_id verb frame_number rnk_frame batch_of_five
video_a frame_1 walk 1 1 1
video_a frame_2 run 2 2 1
video_a frame_3 sit 3 3 1
video_a frame_4 walk 4 4 1
video_a frame_5 walk 5 5 1
video_a frame_6 walk 6 6 2
video_b frame_7 stand 7 1 3
video_b frame_8 stand 8 2 3
video_b frame_9 run 9 3 3
video_b frame_10 run 10 4 3
video_b frame_11 sit 11 5 3
video_b frame_12 run 12 6 4
video_b frame_13 run 13 7 4
I extracted frame_number to sort as integer, not as string, to get the same sort order as in your question (some sorting column is absolutely necessary), if you already have rank as you mentioned in question, you can use it instead.