I have the following table schema:
id | text | persons | timestamp |
---|---|---|---|
001 | hi | p1 | 2022-08-26 21:03:16.697000 |
001 | how are you doing? | p1 | 2022-08-26 21:03:17.528000 |
001 | i'm doing fine | p2 | 2022-08-26 21:03:18.234000 |
001 | hey guys! | p3 | 2022-08-26 21:03:25.349000 |
001 | sorry | p3 | 2022-08-26 21:03:25.723000 |
001 | i'm late | p3 | 2022-08-26 21:03:26.129000 |
002 | good morning bois | p1 | 2023-01-13 18:37:16.697000 |
002 | time for some ranked games eh? | p1 | 2023-01-13 18:37:17.528000 |
002 | i've been waiting | p2 | 2023-01-13 18:37:18.234000 |
002 | let's go | p2 | 2023-01-13 18:37:18.234000 |
002 | hey guys! | p3 | 2023-01-13 18:37:25.349000 |
002 | sorry | p3 | 2023-01-13 18:37:25.723000 |
002 | i'm late | p3 | 2023-01-13 18:37:26.129000 |
Desired Output:
id | text |
---|---|
001 | p1: hi how are you doing? p2: i'm doing fine p3: hey guys! sorry i'm late |
002 | p1: good morning bois time for some ranked games eh? p2: i've been waiting let's go! p3: hey guys! sorry i'm late |
What is a good, efficient way to get the desired output using SQL? OR.. How do I create an intermediate table with column "C" below?
I've been able to achieve this using Python + Pandas , but I'd like to do this operation using SQL. So this is what I've done using Python + Pandas:
First I create an intermediate table with column 'C' that groups each consecutive text by persons as a unique 'C' group. Then I do a group by the id, text and persons columns and aggregate on the text to achieve the desired results above.
Intermediate Table:
id | text | persons | C | timestamp |
---|---|---|---|---|
001 | hi | p1 | 1 | 2022-08-26 21:03:16.697000 |
001 | how are you doing? | p1 | 1 | 2022-08-26 21:03:17.528000 |
001 | im doing fine | p2 | 2 | 2022-08-26 21:03:18.234000 |
001 | hey guys! | p3 | 3 | 2022-08-26 21:03:25.349000 |
001 | sorry | p3 | 3 | 2022-08-26 21:03:25.723000 |
001 | i'm late | p3 | 3 | 2022-08-26 21:03:26.129000 |
002 | good morning bois | p1 | 1 | 2023-01-13 18:37:16.697000 |
002 | time for some ranked games eh? | p1 | 1 | 2023-01-13 18:37:17.528000 |
002 | i've been waiting | p2 | 2 | 2023-01-13 18:37:18.234000 |
002 | let's go | p2 | 2 | 2023-01-13 18:37:18.234000 |
002 | hey guys! | p3 | 3 | 2023-01-13 18:37:25.349000 |
002 | sorry | p3 | 3 | 2023-01-13 18:37:25.723000 |
002 | i'm late | p3 | 3 | 2023-01-13 18:37:26.129000 |
Here's the command I used in Python + Pandas to get the desired output:
df['C'] = (df.persons.ne(df.persons.shift())).astype(int).cumsum()
df.groupby(['C','persons','id'], as_index=False)['text'].agg(' '.join)
I attempted this using SQL as well, but couldn't get the C column to work:
SELECT text,
persons,
row_number() over (partition by id, persons order by timestamp) as C
FROM table
ORDER BY timestamp
You could apply STRING_AGG
twice:
In order to ensure the ordering of your messages, you need to carry timestamp in the aggregate operations.
WITH cte AS (
SELECT id,
MIN(timestamp) AS timestamp,
CONCAT(persons, ': ', STRING_AGG(text, ' ' ORDER BY timestamp), ' ') AS text,
FROM tab
GROUP BY id,
persons
)
SELECT id,
GROUP_CONCAT(text, ' ' ORDER BY timestamp) AS text
FROM cte
GROUP BY id