Search code examples
sqlamazon-redshiftwindow-functions

Redshift: Find preceding rows that satisfy condition to constitute a sequence


I've been trying to find a solution to the following Redshift riddle for a week now (think I am becoming obsessed with it):

There is a table of events in Redshift ("event_user_item"), with users triggering events for certain items by entering item's code that appears in event_value column.

Failed submission consists of event_type sequence PageLoad-ItemCode-ErrorResponse, but such event types are not necessarily consecutive, meaning there can be a number of other event types between each of them per user_id.

I am posting a small excerpt based on 3 different user_ids that should illustrate relevant scenarios focusing on Failed submissions.

ord_num event_type          event_value     user_id     event_datetime
1       PageLoad                            124         03/09/2018 21:48:39
2       ItemCode            LG56731         124         03/09/2018 21:48:53
4       Details1PageLoad                    124         03/09/2018 21:48:56
8       PageLoad                            124         03/09/2018 22:02:23
9       ItemCode            GU07019         124         03/09/2018 22:02:32
10      ErrorResponse       Some message    124         03/09/2018 22:02:32
51      PageLoad                            228         04/09/2018 12:38:30
52      ItemCode            EQ23487         228         04/09/2018 12:38:33
53      ErrorResponse       Some message    228         04/09/2018 12:38:34
54      PageLoad                            304         04/09/2018 15:43:14
55      ItemCode            OB68102         304         04/09/2018 15:43:57
56      ErrorResponse       Some message    304         04/09/2018 15:43:58
57      ItemCode            PB68102         304         04/09/2018 15:44:21
58      ErrorResponse       Some message    304         04/09/2018 15:44:22
59      PageLoad                            304         05/09/2018 11:19:37
60      ItemCode            OB68102         304         05/09/2018 11:20:17
62      Details1PageLoad                    304         05/09/2018 11:20:20

THE OBJECTIVE: find the number of Failed submissions per user_id per ItemCode. What is important is not to mix-up item codes from Failed submissions and Successful submissions. Also, there might be multiple Failure entries of the same item code as well.

I am not an expert in Redshift, especially with its window-functions, but the first idea I tried to stick to was a LAG function. In order to do that, I intended to identify sequences of ord_nums that would qualify to be counted, such as

ord_num event_type          event_value     user_id event_datetime           error?     sequence
1       PageLoad                            124     03/09/2018 21:48:39     
2       ItemCode            LG56731         124     03/09/2018 21:48:53     
4       Details1PageLoad                    124     03/09/2018 21:48:56     
8       PageLoad                            124     03/09/2018 22:02:23     
9       ItemCode            GU07019         124     03/09/2018 22:02:32     
10      ErrorResponse       Some message    124     03/09/2018 22:02:32     1       8-9-10
51      PageLoad                            228     04/09/2018 12:38:30     
52      ItemCode            EQ23487         228     04/09/2018 12:38:33     
53      ErrorResponse       Some message    228     04/09/2018 12:38:34     1       51-52-53
54      PageLoad                            304     04/09/2018 15:43:14     
55      ItemCode            OB68102         304     04/09/2018 15:43:57     
56      ErrorResponse       Some message    304     04/09/2018 15:43:58     1       54-55-56
57      ItemCode            PB68102         304     04/09/2018 15:44:21     
58      ErrorResponse       Some message    304     04/09/2018 15:44:22     1       54-57-58
59      PageLoad                            304     05/09/2018 11:19:37     
60      ItemCode            OB68102         304     05/09/2018 11:20:17     
62      Details1PageLoad                    304     05/09/2018 11:20:20     

So by user_id there should be following counts:

user_id     nr_failed_submissions   
124         1   
228         1   
304         2

However, as it is visible from the above data set and the expected outcome, it is not predictable how many records to move backwards, I need an additional condition that can't be put inside a LAG...

I've tried many options, but none of them fits.

Very useful and insightful posts have been

but until now, I haven't managed to fusion them all into solution that would work. There must be a way to do this in Redshift?


Solution

  • Following approach and queries, based on Jason Rosendale's Answer 1, work for me as they are supposed to:

    create temporary table items_per_pageload as 
    with timeranges as (
      select A.user_id
        ,A.event_datetime as time1
        ,nvl(max(B.event_datetime), '2099-01-01') as time2
        ,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time 
      from event_user_item as A 
      left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
      where A.event_type='PageLoad' 
      group by A.user_id, A.event_datetime
      )
    select timeranges.time1 as pageloadtime, event_user_item.* 
    from event_user_item left join timeranges on event_user_item.event_datetime>=timeranges.time1 and event_user_item.event_datetime<nvl(timeranges.next_load_time,timeranges.time2) 
    where event_user_item.event_type='ItemCode';
    
    create temporary table pageloads_with_errors as 
    with timeranges as (
      select A.user_id
        ,A.event_datetime as time1
        ,nvl(max(B.event_datetime), '2099-01-01') as time2
        ,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time 
      from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
      where A.event_type='PageLoad' 
      group by A.user_id, A.event_datetime
      )
    select timeranges.time1 as pageloadtime,timeranges.user_id,bool_or(event_user_item.event_type='ErrorResponse') as has_error 
    from timeranges 
    left join event_user_item on event_datetime > time1 and event_datetime < nvl(next_load_time,time2)
    group by timeranges.time1,timeranges.user_id 
    having has_error;
    
    /* final counts */
    select count(1), user_id, event_value from (
        select items_per_pageload.* 
        from items_per_pageload 
        join pageloads_with_errors on items_per_pageload.user_id = pageloads_with_errors.user_id and items_per_pageload.pageloadtime = pageloads_with_errors.pageloadtime 
    ) 
    group by user_id, event_value;