Search code examples
sqlamazon-athenaprestotrino

In Trino/Presto SQL: Create a new column that accounts (enumerates) for change in sequence per group


I want to calculate a new column that assigns an ID based on change in sequence per group and within that group.

Consider the following table:

# | user_id | my_date    | color  |
# |---------|------------|--------|
# | a       | 2023-02-01 | red    | ----
# | a       | 2023-03-22 | red    |    |
# | a       | 2023-03-30 | red    |    | this is period *1* for user_id = a
# | a       | 2023-06-10 | red    |    |
# | a       | 2023-06-11 | red    | ----
# | a       | 2023-07-03 | green  |
# | a       | 2023-07-09 | green  |
# | a       | 2024-01-11 | green  |
# | a       | 2024-02-11 | yellow |
# | a       | 2024-02-12 | yellow |
# | a       | 2024-02-13 | yellow |
# | a       | 2024-02-14 | yellow |
# | b       | 2022-10-20 | blue   |
# | b       | 2022-10-21 | blue   |
# | b       | 2022-10-22 | blue   |
# | b       | 2022-10-23 | brown  | ----
# | b       | 2022-10-24 | brown  |    | this is period *2* for user_id = b
# | b       | 2022-10-25 | brown  | ----
# | b       | 2022-10-26 | blue   |
# | b       | 2022-10-27 | blue   |

For each user_id, if we follow along my_date by order (ascending), we can see that there are "periods" or sequences that are characterized by the same color.

I want to create a new column that accounts for the period/sequence.

Expected output

# | user_id | my_date    | color  | period_number |
# |---------|------------|--------|---------------|
# | a       | 2023-02-01 | red    | 1             |
# | a       | 2023-03-22 | red    | 1             |
# | a       | 2023-03-30 | red    | 1             |
# | a       | 2023-06-10 | red    | 1             |
# | a       | 2023-06-11 | red    | 1             |
# | a       | 2023-07-03 | green  | 2             |
# | a       | 2023-07-09 | green  | 2             |
# | a       | 2024-01-11 | green  | 2             |
# | a       | 2024-02-11 | yellow | 3             |
# | a       | 2024-02-12 | yellow | 3             |
# | a       | 2024-02-13 | yellow | 3             |
# | a       | 2024-02-14 | yellow | 3             |
# | b       | 2022-10-20 | blue   | 1             |
# | b       | 2022-10-21 | blue   | 1             |
# | b       | 2022-10-22 | blue   | 1             |
# | b       | 2022-10-23 | brown  | 2             |
# | b       | 2022-10-24 | brown  | 2             |
# | b       | 2022-10-25 | brown  | 2             |
# | b       | 2022-10-26 | blue   | 3             |
# | b       | 2022-10-27 | blue   | 3             |

SQL Dialect

I use AWS Athena, based on Trino SQL.

Reproducible data

WITH my_table AS (
    SELECT *
    FROM (VALUES
        ('a', DATE '2023-02-01', 'red'),
        ('a', DATE '2023-03-22', 'red'),
        ('a', DATE '2023-03-30', 'red'),
        ('a', DATE '2023-06-10', 'red'),
        ('a', DATE '2023-06-11', 'red'),
        ('a', DATE '2023-07-03', 'green'),
        ('a', DATE '2023-07-09', 'green'),
        ('a', DATE '2024-01-11', 'green'),
        ('a', DATE '2024-02-11', 'yellow'),
        ('a', DATE '2024-02-12', 'yellow'),
        ('a', DATE '2024-02-13', 'yellow'),
        ('a', DATE '2024-02-14', 'yellow'),
        ('b', DATE '2022-10-20', 'blue'),
        ('b', DATE '2022-10-21', 'blue'),
        ('b', DATE '2022-10-22', 'blue'),
        ('b', DATE '2022-10-23', 'brown'),
        ('b', DATE '2022-10-24', 'brown'),
        ('b', DATE '2022-10-25', 'brown'),
        ('b', DATE '2022-10-26', 'blue'),
        ('b', DATE '2022-10-27', 'blue')
    ) AS t(user_id, my_date, color)
)

SELECT *
FROM my_table; 

UPDATE – One possible direction

From this answer I learned that I could mark the rows in which the change in color happens, per user_id:

breaking_points_marked AS (
SELECT *, 
       lag(color) OVER (PARTITION BY user_id ORDER BY my_date) IS DISTINCT FROM color AS is_starting_a_new_period
FROM   my_table
)

SELECT *
FROM breaking_points_marked 
ORDER BY user_id, 
         my_date; 

-- user_id|my_date   |color |is_starting_a_new_period|
-- -------+----------+------+------------------------+
-- a      |2023-02-01|red   |true                    |
-- a      |2023-03-22|red   |false                   |
-- a      |2023-03-30|red   |false                   |
-- a      |2023-06-10|red   |false                   |
-- a      |2023-06-11|red   |false                   |
-- a      |2023-07-03|green |true                    |
-- a      |2023-07-09|green |false                   |
-- a      |2024-01-11|green |false                   |
-- a      |2024-02-11|yellow|true                    |
-- a      |2024-02-12|yellow|false                   |
-- a      |2024-02-13|yellow|false                   |
-- a      |2024-02-14|yellow|false                   |
-- b      |2022-10-20|blue  |true                    |
-- b      |2022-10-21|blue  |false                   |
-- b      |2022-10-22|blue  |false                   |
-- b      |2022-10-23|brown |true                    |
-- b      |2022-10-24|brown |false                   |
-- b      |2022-10-25|brown |false                   |
-- b      |2022-10-26|blue  |true                    |
-- b      |2022-10-27|blue  |false                   |

Based on is_starting_a_new_period, the solution seems closer. I need to find a way to add another column that starts at 1 per user_id, and increments at each true (while it carries over the value along the falses).

I thought I had found the solution in this comment, but unfortunately I don't know how to adjust it to my situation.

SELECT *,
       1 + COUNT(CASE WHEN is_starting_a_new_period = TRUE THEN 1 END) OVER (PARTITION BY user_id 
                                                                             ORDER BY my_date 
                                                                             ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS period_number
FROM breaking_points_marked 
ORDER BY user_id, 
         my_date;

-- not the desired output :/

-- user_id|my_date   |color |is_starting_a_new_period|period_number|
-- -------+----------+------+------------------------+-------------+
-- a      |2023-02-01|red   |true                    |            1|
-- a      |2023-03-22|red   |false                   |            2|
-- a      |2023-03-30|red   |false                   |            2|
-- a      |2023-06-10|red   |false                   |            2|
-- a      |2023-06-11|red   |false                   |            2|
-- a      |2023-07-03|green |true                    |            2|
-- a      |2023-07-09|green |false                   |            3|
-- a      |2024-01-11|green |false                   |            3|
-- a      |2024-02-11|yellow|true                    |            3|
-- a      |2024-02-12|yellow|false                   |            4|
-- a      |2024-02-13|yellow|false                   |            4|
-- a      |2024-02-14|yellow|false                   |            4|
-- b      |2022-10-20|blue  |true                    |            1|
-- b      |2022-10-21|blue  |false                   |            2|
-- b      |2022-10-22|blue  |false                   |            2|
-- b      |2022-10-23|brown |true                    |            2|
-- b      |2022-10-24|brown |false                   |            3|
-- b      |2022-10-25|brown |false                   |            3|
-- b      |2022-10-26|blue  |true                    |            3|
-- b      |2022-10-27|blue  |false                   |            4|

Solution

  • See example

    with my_table AS (
        SELECT *
        FROM (VALUES
            ('a',  '2023-02-01', 'red'),
            ('a',  '2023-03-30', 'red'),
            ('a',  '2023-06-10', 'red'),
            ('a',  '2023-06-11', 'red'),
            ('a',  '2023-07-03', 'green'),
            ('a',  '2023-07-09', 'green'),
            ('a',  '2024-01-11', 'green'),
            ('a',  '2024-02-11', 'yellow'),
            ('a',  '2024-02-12', 'yellow'),
            ('a',  '2024-02-13', 'yellow'),
            ('a',  '2024-02-14', 'yellow'),
            ('b',  '2022-10-20', 'blue'),
            ('b',  '2022-10-21', 'blue'),
            ('b',  '2022-10-22', 'blue'),
            ('b',  '2022-10-23', 'brown'),
            ('b',  '2022-10-24', 'brown'),
            ('b',  '2022-10-25', 'brown'),
            ('b',  '2022-10-26', 'blue'),
            ('b',  '2022-10-27', 'blue')
        ) AS t(user_id, my_date, color)
    )
    

    query

    ,periods as(select user_id,color,my_date
      ,(select min(my_date) from my_table t2 
        where t2.user_id=t.user_id and t2.my_date>t.my_date 
         and t2.color<>t.color) next_date
    from my_table t
    )
    ,gr as(
    select user_id,color,next_date
      ,min(my_date)mindt 
    from periods
    group by user_id,color,next_date
    )
    ,res as(
    select t.* 
    from my_table t
    left join gr g on g.user_id=t.user_id
      and g.mindt<=t.my_date
    )
    select user_id,color,my_date,count(*) period_number
    from res
    group by user_id,color,my_date
    order by user_id,my_date;
    

    Update1.

    If possible in your DBMS to use windows functions, the task becomes one of the gap and islands type tasks.
    See example

    with cte1 as(
    select *
      ,case when color=lag(color,1,color)over(partition by user_id order by my_date)
         then 0
       else 1
       end new_period
    from my_table
    )
    select * 
      ,sum(new_period)over(partition by user_id order by my_date)+1 period_number
    from cte1
    

    Update2.
    There compared current color and previous color (within a sequence of rows by user_Id and date).

    color=lag(color,1,color)over(partition by user_id order by my_date)
    

    If color is changed - so, that is start of new period.
    For first row for (user_Id, date) lag(color) returns null - previous row absent.
    We use lag(color,1,color) - return previous color or return current color if no previous row.
    So, first row of group has new_period=1.

    Further,

     sum(new_period)over(partition by user_id order by my_date)+1 period_number
    

    count of start new period is a number of period. +1 for numbering from 1, else period_number will 0,1,2....

    Output

    user_id my_date color new_period period_number
    a 2023-02-01 red 0 1
    a 2023-03-30 red 0 1
    a 2023-06-10 red 0 1
    a 2023-06-11 red 0 1
    a 2023-07-03 green 1 2
    a 2023-07-09 green 0 2
    a 2024-01-11 green 0 2
    a 2024-02-11 yellow 1 3
    a 2024-02-12 yellow 0 3
    a 2024-02-13 yellow 0 3
    a 2024-02-14 yellow 0 3
    b 2022-10-20 blue 0 1
    b 2022-10-21 blue 0 1
    b 2022-10-22 blue 0 1
    b 2022-10-23 brown 1 2
    b 2022-10-24 brown 0 2
    b 2022-10-25 brown 0 2
    b 2022-10-26 blue 1 3
    b 2022-10-27 blue 0 3