I've been trying to find a solution to the following Redshift riddle for a week now (think I am becoming obsessed with it):
There is a table of events in Redshift ("event_user_item"), with users triggering events for certain items by entering item's code that appears in event_value column.
Failed submission consists of event_type sequence PageLoad-ItemCode-ErrorResponse, but such event types are not necessarily consecutive, meaning there can be a number of other event types between each of them per user_id.
I am posting a small excerpt based on 3 different user_ids that should illustrate relevant scenarios focusing on Failed submissions.
ord_num event_type event_value user_id event_datetime
1 PageLoad 124 03/09/2018 21:48:39
2 ItemCode LG56731 124 03/09/2018 21:48:53
4 Details1PageLoad 124 03/09/2018 21:48:56
8 PageLoad 124 03/09/2018 22:02:23
9 ItemCode GU07019 124 03/09/2018 22:02:32
10 ErrorResponse Some message 124 03/09/2018 22:02:32
51 PageLoad 228 04/09/2018 12:38:30
52 ItemCode EQ23487 228 04/09/2018 12:38:33
53 ErrorResponse Some message 228 04/09/2018 12:38:34
54 PageLoad 304 04/09/2018 15:43:14
55 ItemCode OB68102 304 04/09/2018 15:43:57
56 ErrorResponse Some message 304 04/09/2018 15:43:58
57 ItemCode PB68102 304 04/09/2018 15:44:21
58 ErrorResponse Some message 304 04/09/2018 15:44:22
59 PageLoad 304 05/09/2018 11:19:37
60 ItemCode OB68102 304 05/09/2018 11:20:17
62 Details1PageLoad 304 05/09/2018 11:20:20
THE OBJECTIVE: find the number of Failed submissions per user_id per ItemCode. What is important is not to mix-up item codes from Failed submissions and Successful submissions. Also, there might be multiple Failure entries of the same item code as well.
I am not an expert in Redshift, especially with its window-functions, but the first idea I tried to stick to was a LAG function. In order to do that, I intended to identify sequences of ord_nums that would qualify to be counted, such as
ord_num event_type event_value user_id event_datetime error? sequence
1 PageLoad 124 03/09/2018 21:48:39
2 ItemCode LG56731 124 03/09/2018 21:48:53
4 Details1PageLoad 124 03/09/2018 21:48:56
8 PageLoad 124 03/09/2018 22:02:23
9 ItemCode GU07019 124 03/09/2018 22:02:32
10 ErrorResponse Some message 124 03/09/2018 22:02:32 1 8-9-10
51 PageLoad 228 04/09/2018 12:38:30
52 ItemCode EQ23487 228 04/09/2018 12:38:33
53 ErrorResponse Some message 228 04/09/2018 12:38:34 1 51-52-53
54 PageLoad 304 04/09/2018 15:43:14
55 ItemCode OB68102 304 04/09/2018 15:43:57
56 ErrorResponse Some message 304 04/09/2018 15:43:58 1 54-55-56
57 ItemCode PB68102 304 04/09/2018 15:44:21
58 ErrorResponse Some message 304 04/09/2018 15:44:22 1 54-57-58
59 PageLoad 304 05/09/2018 11:19:37
60 ItemCode OB68102 304 05/09/2018 11:20:17
62 Details1PageLoad 304 05/09/2018 11:20:20
So by user_id there should be following counts:
user_id nr_failed_submissions
124 1
228 1
304 2
However, as it is visible from the above data set and the expected outcome, it is not predictable how many records to move backwards, I need an additional condition that can't be put inside a LAG...
I've tried many options, but none of them fits.
Very useful and insightful posts have been
but until now, I haven't managed to fusion them all into solution that would work. There must be a way to do this in Redshift?
Following approach and queries, based on Jason Rosendale's Answer 1, work for me as they are supposed to:
create temporary table items_per_pageload as
with timeranges as (
select A.user_id
,A.event_datetime as time1
,nvl(max(B.event_datetime), '2099-01-01') as time2
,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time
from event_user_item as A
left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad'
group by A.user_id, A.event_datetime
)
select timeranges.time1 as pageloadtime, event_user_item.*
from event_user_item left join timeranges on event_user_item.event_datetime>=timeranges.time1 and event_user_item.event_datetime<nvl(timeranges.next_load_time,timeranges.time2)
where event_user_item.event_type='ItemCode';
create temporary table pageloads_with_errors as
with timeranges as (
select A.user_id
,A.event_datetime as time1
,nvl(max(B.event_datetime), '2099-01-01') as time2
,LEAD(A.event_datetime,1) over (partition by A.user_id order by A.event_datetime) as next_load_time
from event_user_item as A left join event_user_item as B on A.user_id=B.user_id and A.event_datetime < B.event_datetime and A.event_type=B.event_type
where A.event_type='PageLoad'
group by A.user_id, A.event_datetime
)
select timeranges.time1 as pageloadtime,timeranges.user_id,bool_or(event_user_item.event_type='ErrorResponse') as has_error
from timeranges
left join event_user_item on event_datetime > time1 and event_datetime < nvl(next_load_time,time2)
group by timeranges.time1,timeranges.user_id
having has_error;
/* final counts */
select count(1), user_id, event_value from (
select items_per_pageload.*
from items_per_pageload
join pageloads_with_errors on items_per_pageload.user_id = pageloads_with_errors.user_id and items_per_pageload.pageloadtime = pageloads_with_errors.pageloadtime
)
group by user_id, event_value;