Search code examples
sqlstringsnowflake-cloud-data-platformwindow-functionspartition

Selecting a row in SQL (Snowflake) based on String


So I have a toughie here I've been wracking my brain on for a while.

Let's say I have a table as follows:

ID      Group                 Timestamp   Data
001         A   2021-04-13 12:51:12.063   content121
001  A-Direct   2021-04-13 12:52:13.063   content121
002  A-Direct   2021-04-13 12:50:14.063   content133
003  B-Direct   2021-04-13 12:55:12.063   content132
003         B   2021-04-13 12:56:11.063   content142
003        BA   2021-04-13 12:57:22.063   content153
004         D   2021-04-13 12:10:23.063   content113
004         C   2021-04-13 12:11:43.063   content144
005         C   2021-04-13 12:12:12.063   content111
005         A   2021-04-13 12:13:23.063   content100
005  D-Direct   2021-04-13 12:15:23.063   content121
006         A   2021-04-13 12:51:12.063   content121
006  B-Direct   2021-04-13 12:52:13.063   content121
007  A-Direct   2021-04-13 12:51:12.063   content121
007         A   2021-04-13 12:52:13.063   content121
008  B-Direct   2021-04-13 12:55:12.063   content132
008         B   2021-04-13 12:56:11.063   content142
008  B-Direct   2021-04-13 12:57:22.063   content153
009  B-Direct   2021-04-13 12:55:12.063   content132
009  C-Direct   2021-04-13 12:56:11.063   content142
009  D-Direct   2021-04-13 12:57:22.063   content153

So I need a table which contains one distinct ID as each row. But the selection criteria on which ID makes it is a little complicated.

The default selection should be the most recent entry, selected via TIMESTAMP.

But the complexity comes from any ID's that have rows with -Direct. Specifically, if a row has more than one entry and one is (for example) A and the other is A-Direct, we need the A. This is only the case when the letter matches. As seen in the case for ID = 006, we want B-Direct since its counterpart is A.

So at it's core the logic I am looking for is

If an ID has rows beginning with the same string, and one of them ends in -Direct, substitute it with the -Direct removed.

Final Output:

ID      Group
001         A
002  A-Direct
003        BA
004         C
005  D-Direct
006  B-Direct
007         A
008         B
009  D-Direct

For added clarity, here is an outline of what happened to each ID:

  • ID 001: A is followed by A-Direct so we sub A-Direct for A
  • ID 002: A-Direct is the only result, easy!
  • ID 003: BA,B, B-Direct are distinct, therefore we stick with the most recent, BA.
  • ID 004: No direct, so we just take the most recent, C
  • ID 005: D-Direct is the most recent, but because there is no D , we stick with D-Direct
  • ID 006: B-Direct is the most recent, but because there is no B , we stick with B-Direct
  • ID 007: A-Direct is followed by A so we simply take the most recent one, no problem.
  • ID 008: Band B-Direct (x2) appear here, therefore we can use B.
  • ID 009: All options are Direct, so we go with the most recent, D-Direct

I can figure out how to get the most recent, but with the above criteria, I am unsure how to adjust

WITH data AS (
    select d.*,
        rank() over (
            partition by ID
            order by TIMESTAMP DESC
        ) as num
    FROM table d
)
select ID, TIMESTAMP
    from data
    where num = 1

Solution

  • I might start with something like the following. It isn't super-pretty so there might be a better solution, but I think it does what you want.

    WITH data AS (
        select d.*,
            rank() over (
                partition by ID
                order by TIMESTAMP DESC
            ) as num
        FROM table d
    )
    select ID, 
     CASE 
      WHEN EXISTS (SELECT * FROM table t WHERE t.id = d.id AND t.group || '-Direct' = d.group) 
       THEN replace(d.group, '-Direct') 
       ELSE d.group 
     END group
        from data d
        where num = 1
    

    This gets the most recent one for each id (using your current code), but the case/exists statement in the select clause checks whether there is a match without "-Direct" and, if so, we remove the "-Direct" from the string.