Search code examples
sqlprestorow-number

row_number() expression question in SQL Presto


Sample table:

object_id event_time event_type event_subtype stage
1 2022-10-01 create name, stage A
1 2022-10-02 update stage B
1 2022-10-03 update stage C
1 2022-10-04 update stage A
2 2022-10-01 create name, stage A
2 2022-10-02 update stage C
2 2022-10-03 update stage A
2 2022-10-04 update stage B
2 2022-10-05 update stage C
2 2022-10-06 update stage A

So what I need is a column that numbers the rows based on the stage - after an object_id reaches stage C, the row number of the same object_id should be incremented. It'd look like this:

object_id event_time event_type event_subtype stage row_number
1 2022-10-01 create name, stage A 1
1 2022-10-02 update stage B 1
1 2022-10-03 update stage C 1
1 2022-10-04 update stage A 2
2 2022-10-01 create name, stage A 1
2 2022-10-02 update stage C 1
2 2022-10-03 update stage A 2
2 2022-10-04 update stage B 2
2 2022-10-05 update stage C 2
2 2022-10-06 update stage A 3

The table must be ordered by object_id, event_time. I'm having troublle writing the window function that does this, this is what I've tried:

row_number() over (partition by object_id, stage order by event_time)

It just doesn't work for all cases. Also I'm having a hard time understanding how this would work when I'm not defining stage = C as the delimiter anywhere. Any ideas?

Thanks!


Solution

  • I would suggest using rolling sum based on previous value of stage:

    -- sample data
    with dataset(object_id, event_time, event_type, event_subtype, stage) as (
        values    (1, '2022-10-01', 'create',   'name, stage', 'A'),
        (1, '2022-10-02', 'update', 'stage', 'B'),
        (1, '2022-10-03', 'update', 'stage', 'C'),
        (1, '2022-10-04', 'update', 'stage', 'A'),
        (2, '2022-10-01', 'create', 'name, stage',' A'),
        (2, '2022-10-02', 'update', 'stage', 'C'),
        (2, '2022-10-03', 'update', 'stage', 'A'),
        (2, '2022-10-04', 'update', 'stage', 'B'),
        (2, '2022-10-05', 'update', 'stage', 'C'),
        (2, '2022-10-06', 'update', 'stage', 'A')
    )
    
    -- query
    select object_id,
           event_time,
           event_type,
           event_subtype,
           stage,
           1 + sum(counter) over (partition by object_id order by event_time) as num
    from (select *,
                 if(lag(stage) over (partition by object_id order by event_time) = 'C', 1, 0) counter
          from dataset);
    

    Output:

    object_id event_time event_type event_subtype stage num
    1 2022-10-01 create name, stage A 1
    1 2022-10-02 update stage B 1
    1 2022-10-03 update stage C 1
    1 2022-10-04 update stage A 2
    2 2022-10-01 create name, stage A 1
    2 2022-10-02 update stage C 1
    2 2022-10-03 update stage A 2
    2 2022-10-04 update stage B 2
    2 2022-10-05 update stage C 2
    2 2022-10-06 update stage A 3