Search code examples
duplicatesgroup-concatlistaggsnowflake-cloud-data-platform

listagg: remove adjacent duplicates


I have timestamped data and want to create a list from one column, with adjacent duplicates (but not all duplicates) collapsed into one.

For example, given the following data:

'2001-01-01 00:00:01' 'a'
'2001-01-01 00:00:02' 'a'
'2001-01-01 00:00:03' 'b'
'2001-01-01 00:00:04' 'b'
'2001-01-01 00:00:05' 'b'
'2001-01-01 00:00:06' 'a'
'2001-01-01 00:00:07' 'a'
'2001-01-01 00:00:08' 'c'
'2001-01-01 00:00:09' 'a'

— I would want the result to be 'a','b','a','c','a'.

I am using Snowflake, which has listagg(distinct foo) and listagg(distinct foo) within group(order by bar) and even listagg(distinct foo) within group(order by bar) over(partition by baz), but I don't see a way to do what I need (nor is Google helping). I would really, really like to avoid a join.

If you know a solution in another dialect that has listagg or group_concat, please post it, and I'll attempt to translate it to Snowflake for my use. Many thanks.


Things that don't work:

  • I tried trim(regexp_replace('~' || listagg(foo, '~') || '~', '~([^~]+~)\\1', '~\\1'), '~'), but Snowflake doesn't allow \1 in the match pattern: I get the error Invalid regular expression: '~([^~]+~)\1', invalid escape sequence: \1.
  • I tried listagg(iff(lag(foo) ignore nulls over(partition by baz order by bar)=foo, null, foo), ',') within group(order by bar) over(partition by baz) but got the error Window function [LAG(...)] may not be nested inside another window function.

Solution

  • I don't think Snowflake supports backreferences in regexp patterns, unfortunately.

    Possible solutions:

    • eliminate duplicates in an input stream using LAG, e.g.

      with sub as (select foo, bar, lag(bar) over (order by foo) barlag)
      select listagg(foo) within group order by (bar) from foo 
      where barlag is null or barlag <> lag;
      
    • use LISTAGG, but write a JavaScript UDF that splits a result of LISTAGG and eliminate duplicates there

    • write a JavaScript UDTF (table function) that performs LISTAGG with duplicate elimination