Search code examples
sqlgroup-bygoogle-bigquery

How to create an ordered rank/group based on consecutive values?


I have the following table schema:

id text persons timestamp
001 hi p1 2022-08-26 21:03:16.697000
001 how are you doing? p1 2022-08-26 21:03:17.528000
001 i'm doing fine p2 2022-08-26 21:03:18.234000
001 hey guys! p3 2022-08-26 21:03:25.349000
001 sorry p3 2022-08-26 21:03:25.723000
001 i'm late p3 2022-08-26 21:03:26.129000
002 good morning bois p1 2023-01-13 18:37:16.697000
002 time for some ranked games eh? p1 2023-01-13 18:37:17.528000
002 i've been waiting p2 2023-01-13 18:37:18.234000
002 let's go p2 2023-01-13 18:37:18.234000
002 hey guys! p3 2023-01-13 18:37:25.349000
002 sorry p3 2023-01-13 18:37:25.723000
002 i'm late p3 2023-01-13 18:37:26.129000

Desired Output:

id text
001 p1: hi how are you doing? p2: i'm doing fine p3: hey guys! sorry i'm late
002 p1: good morning bois time for some ranked games eh? p2: i've been waiting let's go! p3: hey guys! sorry i'm late

What is a good, efficient way to get the desired output using SQL? OR.. How do I create an intermediate table with column "C" below?

I've been able to achieve this using Python + Pandas , but I'd like to do this operation using SQL. So this is what I've done using Python + Pandas:

First I create an intermediate table with column 'C' that groups each consecutive text by persons as a unique 'C' group. Then I do a group by the id, text and persons columns and aggregate on the text to achieve the desired results above.

Intermediate Table:

id text persons C timestamp
001 hi p1 1 2022-08-26 21:03:16.697000
001 how are you doing? p1 1 2022-08-26 21:03:17.528000
001 im doing fine p2 2 2022-08-26 21:03:18.234000
001 hey guys! p3 3 2022-08-26 21:03:25.349000
001 sorry p3 3 2022-08-26 21:03:25.723000
001 i'm late p3 3 2022-08-26 21:03:26.129000
002 good morning bois p1 1 2023-01-13 18:37:16.697000
002 time for some ranked games eh? p1 1 2023-01-13 18:37:17.528000
002 i've been waiting p2 2 2023-01-13 18:37:18.234000
002 let's go p2 2 2023-01-13 18:37:18.234000
002 hey guys! p3 3 2023-01-13 18:37:25.349000
002 sorry p3 3 2023-01-13 18:37:25.723000
002 i'm late p3 3 2023-01-13 18:37:26.129000

Here's the command I used in Python + Pandas to get the desired output:

df['C'] = (df.persons.ne(df.persons.shift())).astype(int).cumsum()
df.groupby(['C','persons','id'], as_index=False)['text'].agg(' '.join)

I attempted this using SQL as well, but couldn't get the C column to work:

SELECT text, 
       persons, 
       row_number() over (partition by id, persons order by timestamp) as C
FROM table
ORDER BY timestamp

Solution

  • You could apply STRING_AGG twice:

    • first to aggregate on persons, and add "p:"
    • then to aggregate all your partial texts

    In order to ensure the ordering of your messages, you need to carry timestamp in the aggregate operations.

    WITH cte AS (
        SELECT id,
               MIN(timestamp) AS timestamp,
               CONCAT(persons, ': ', STRING_AGG(text, ' ' ORDER BY timestamp), ' ') AS text,
        FROM tab
        GROUP BY id, 
                 persons
    )
    SELECT id, 
           GROUP_CONCAT(text, ' ' ORDER BY timestamp) AS text
    FROM cte
    GROUP BY id