Search code examples
amazon-web-servicesamazon-redshiftrow-number

Exclude duplicates records that follows in Redshift


I have a simple SQL question that I couldn't solve (I'm using Amazon Redshift).

Let's say I have the following example:

id,  type,  channel, date, column1, column2, column3, column4
1,   visit, seo,  07/08/2017: 11:11:22
1,   hit, seo,  07/08/2017: 11:12:34
1,   hit, seo,  07/08/2017: 11:13:22
1,   visit, sem,   07/08/2017: 11:15:11
1,   scarf, display,   07/08/2017: 11:15:45
1,   hit, display,   07/08/2017: 11:15:37
1,   hit, seo,  07/08/2017: 11:18:22
1,   hit, display  07/08/2017: 11:18:23
1,   hit, referal  07/08/2017: 11:19:55

I'd like to select all visits (which in my logic table corresponds to the beginning of each row related to a specific ID, and also exclude 'channel' duplicates that comes ones after each other, my example should return :

1,   visit, seo,  07/08/2017: 11:11:22
**1,   hit, seo,  07/08/2017: 11:12:34** (exclude because it follows seo and it's not a visit)
**1,   hit, seo,  07/08/2017: 11:13:22** (exclude because it follows seo and it's not a visit)
1,   visit, sem,   07/08/2017: 11:15:11 (include, new channel)
1,   scarf, display,   07/08/2017: 11:15:45 (include, new channel)
**1,   hit, display,   07/08/2017: 11:15:37** (exclude because it follows display and it's not a visit)
1,   hit, seo,  07/08/2017: 11:18:22 (include because it doesn't follow seo directly, even if seo is already present) 
1,   hit, display  07/08/2017: 11:18:23 ((include because it doesn't follow display directly, even if display is already present) 
1,   hit, referal  07/08/2017: 11:19:55 (include, new channel)

I've tried using row number (as I'm working with Redshift) :

select type, date, id, ROW_NUMBER() OVER (PARTITION BY id, channel ORDER BY date) as rn

and then add a filter:

Where type='visit' or rn=1

But this doesn't solve the problem as it will not return the 7th and 8th rows:

1, hit, seo, 07/08/2017: 11:18:22 (will be rn=4 for 'id=1, channel=seo' combination)
1, hit, display 07/08/2017: 11:18:23 (will be rn=3 for 'id=1, channel=display' combination)

Could anyone give me please an indication so I could solve the problem?


Solution

  • You can use lag to only select rows where the previous channel is different or the type is a visit

    select * from (
        select * , 
            lag(channel) over (partition by id, order by date) prev_channel
        from mytable
    ) t where prev_channel <> channel or type = 'visit' or prev_channel is null