Search code examples
sqlgoogle-bigquerydata-manipulationdata-cleaninggaps-and-islands

Removing consecutive duplicate rows bigquery


I have some event data that are grouped into sessions. In some cases there are consecutive duplicate events within those sessions which I'd like to remove. Is there a way to do this in bigquery?

E.g.

Initial table:

session   eventType   eventOrder
1         a           1
1         a           2     
1         b           3
2         b           1
2         a           2
2         b           3     

Desired result:

session   eventType   eventOrder
1         a           1
1         b           3
2         b           1
2         a           2
2         b           3    

Solution

  • You could use lag(). The idea is to compare the current event type to the previous value for the same session, and eliminate records where they match:

    select session, eventType, eventOrder
    from (
        select 
            t.*,
            lag(eventType) over(partition by session order by eventOrder) lag_eventType
        from mytable t
    ) t
    where lag_eventType is null or eventType != lag_eventType