I want to calculate a new column that assigns an ID based on change in sequence per group and within that group.
Consider the following table:
# | user_id | my_date | color |
# |---------|------------|--------|
# | a | 2023-02-01 | red | ----
# | a | 2023-03-22 | red | |
# | a | 2023-03-30 | red | | this is period *1* for user_id = a
# | a | 2023-06-10 | red | |
# | a | 2023-06-11 | red | ----
# | a | 2023-07-03 | green |
# | a | 2023-07-09 | green |
# | a | 2024-01-11 | green |
# | a | 2024-02-11 | yellow |
# | a | 2024-02-12 | yellow |
# | a | 2024-02-13 | yellow |
# | a | 2024-02-14 | yellow |
# | b | 2022-10-20 | blue |
# | b | 2022-10-21 | blue |
# | b | 2022-10-22 | blue |
# | b | 2022-10-23 | brown | ----
# | b | 2022-10-24 | brown | | this is period *2* for user_id = b
# | b | 2022-10-25 | brown | ----
# | b | 2022-10-26 | blue |
# | b | 2022-10-27 | blue |
For each user_id
, if we follow along my_date
by order (ascending), we can see that there are "periods" or sequences that are characterized by the same color
.
I want to create a new column that accounts for the period/sequence.
# | user_id | my_date | color | period_number |
# |---------|------------|--------|---------------|
# | a | 2023-02-01 | red | 1 |
# | a | 2023-03-22 | red | 1 |
# | a | 2023-03-30 | red | 1 |
# | a | 2023-06-10 | red | 1 |
# | a | 2023-06-11 | red | 1 |
# | a | 2023-07-03 | green | 2 |
# | a | 2023-07-09 | green | 2 |
# | a | 2024-01-11 | green | 2 |
# | a | 2024-02-11 | yellow | 3 |
# | a | 2024-02-12 | yellow | 3 |
# | a | 2024-02-13 | yellow | 3 |
# | a | 2024-02-14 | yellow | 3 |
# | b | 2022-10-20 | blue | 1 |
# | b | 2022-10-21 | blue | 1 |
# | b | 2022-10-22 | blue | 1 |
# | b | 2022-10-23 | brown | 2 |
# | b | 2022-10-24 | brown | 2 |
# | b | 2022-10-25 | brown | 2 |
# | b | 2022-10-26 | blue | 3 |
# | b | 2022-10-27 | blue | 3 |
I use AWS Athena, based on Trino SQL.
WITH my_table AS (
SELECT *
FROM (VALUES
('a', DATE '2023-02-01', 'red'),
('a', DATE '2023-03-22', 'red'),
('a', DATE '2023-03-30', 'red'),
('a', DATE '2023-06-10', 'red'),
('a', DATE '2023-06-11', 'red'),
('a', DATE '2023-07-03', 'green'),
('a', DATE '2023-07-09', 'green'),
('a', DATE '2024-01-11', 'green'),
('a', DATE '2024-02-11', 'yellow'),
('a', DATE '2024-02-12', 'yellow'),
('a', DATE '2024-02-13', 'yellow'),
('a', DATE '2024-02-14', 'yellow'),
('b', DATE '2022-10-20', 'blue'),
('b', DATE '2022-10-21', 'blue'),
('b', DATE '2022-10-22', 'blue'),
('b', DATE '2022-10-23', 'brown'),
('b', DATE '2022-10-24', 'brown'),
('b', DATE '2022-10-25', 'brown'),
('b', DATE '2022-10-26', 'blue'),
('b', DATE '2022-10-27', 'blue')
) AS t(user_id, my_date, color)
)
SELECT *
FROM my_table;
From this answer I learned that I could mark the rows in which the change in color
happens, per user_id
:
breaking_points_marked AS (
SELECT *,
lag(color) OVER (PARTITION BY user_id ORDER BY my_date) IS DISTINCT FROM color AS is_starting_a_new_period
FROM my_table
)
SELECT *
FROM breaking_points_marked
ORDER BY user_id,
my_date;
-- user_id|my_date |color |is_starting_a_new_period|
-- -------+----------+------+------------------------+
-- a |2023-02-01|red |true |
-- a |2023-03-22|red |false |
-- a |2023-03-30|red |false |
-- a |2023-06-10|red |false |
-- a |2023-06-11|red |false |
-- a |2023-07-03|green |true |
-- a |2023-07-09|green |false |
-- a |2024-01-11|green |false |
-- a |2024-02-11|yellow|true |
-- a |2024-02-12|yellow|false |
-- a |2024-02-13|yellow|false |
-- a |2024-02-14|yellow|false |
-- b |2022-10-20|blue |true |
-- b |2022-10-21|blue |false |
-- b |2022-10-22|blue |false |
-- b |2022-10-23|brown |true |
-- b |2022-10-24|brown |false |
-- b |2022-10-25|brown |false |
-- b |2022-10-26|blue |true |
-- b |2022-10-27|blue |false |
Based on is_starting_a_new_period
, the solution seems closer. I need to find a way to add another column that starts at 1 per user_id
, and increments at each true
(while it carries over the value along the false
s).
I thought I had found the solution in this comment, but unfortunately I don't know how to adjust it to my situation.
SELECT *,
1 + COUNT(CASE WHEN is_starting_a_new_period = TRUE THEN 1 END) OVER (PARTITION BY user_id
ORDER BY my_date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS period_number
FROM breaking_points_marked
ORDER BY user_id,
my_date;
-- not the desired output :/
-- user_id|my_date |color |is_starting_a_new_period|period_number|
-- -------+----------+------+------------------------+-------------+
-- a |2023-02-01|red |true | 1|
-- a |2023-03-22|red |false | 2|
-- a |2023-03-30|red |false | 2|
-- a |2023-06-10|red |false | 2|
-- a |2023-06-11|red |false | 2|
-- a |2023-07-03|green |true | 2|
-- a |2023-07-09|green |false | 3|
-- a |2024-01-11|green |false | 3|
-- a |2024-02-11|yellow|true | 3|
-- a |2024-02-12|yellow|false | 4|
-- a |2024-02-13|yellow|false | 4|
-- a |2024-02-14|yellow|false | 4|
-- b |2022-10-20|blue |true | 1|
-- b |2022-10-21|blue |false | 2|
-- b |2022-10-22|blue |false | 2|
-- b |2022-10-23|brown |true | 2|
-- b |2022-10-24|brown |false | 3|
-- b |2022-10-25|brown |false | 3|
-- b |2022-10-26|blue |true | 3|
-- b |2022-10-27|blue |false | 4|
See example
with my_table AS (
SELECT *
FROM (VALUES
('a', '2023-02-01', 'red'),
('a', '2023-03-30', 'red'),
('a', '2023-06-10', 'red'),
('a', '2023-06-11', 'red'),
('a', '2023-07-03', 'green'),
('a', '2023-07-09', 'green'),
('a', '2024-01-11', 'green'),
('a', '2024-02-11', 'yellow'),
('a', '2024-02-12', 'yellow'),
('a', '2024-02-13', 'yellow'),
('a', '2024-02-14', 'yellow'),
('b', '2022-10-20', 'blue'),
('b', '2022-10-21', 'blue'),
('b', '2022-10-22', 'blue'),
('b', '2022-10-23', 'brown'),
('b', '2022-10-24', 'brown'),
('b', '2022-10-25', 'brown'),
('b', '2022-10-26', 'blue'),
('b', '2022-10-27', 'blue')
) AS t(user_id, my_date, color)
)
query
,periods as(select user_id,color,my_date
,(select min(my_date) from my_table t2
where t2.user_id=t.user_id and t2.my_date>t.my_date
and t2.color<>t.color) next_date
from my_table t
)
,gr as(
select user_id,color,next_date
,min(my_date)mindt
from periods
group by user_id,color,next_date
)
,res as(
select t.*
from my_table t
left join gr g on g.user_id=t.user_id
and g.mindt<=t.my_date
)
select user_id,color,my_date,count(*) period_number
from res
group by user_id,color,my_date
order by user_id,my_date;
Update1.
If possible in your DBMS to use windows functions, the task becomes one of the gap and islands
type tasks.
See example
with cte1 as(
select *
,case when color=lag(color,1,color)over(partition by user_id order by my_date)
then 0
else 1
end new_period
from my_table
)
select *
,sum(new_period)over(partition by user_id order by my_date)+1 period_number
from cte1
Update2.
There compared current color and previous color (within a sequence of rows by user_Id and date).
color=lag(color,1,color)over(partition by user_id order by my_date)
If color is changed - so, that is start of new period.
For first row for (user_Id, date) lag(color) returns null - previous row absent.
We use lag(color,1,color) - return previous color or return current color if no previous row.
So, first row of group has new_period=1
.
Further,
sum(new_period)over(partition by user_id order by my_date)+1 period_number
count of start new period
is a number of period. +1
for numbering from 1, else period_number will 0,1,2...
.
Output
user_id | my_date | color | new_period | period_number |
---|---|---|---|---|
a | 2023-02-01 | red | 0 | 1 |
a | 2023-03-30 | red | 0 | 1 |
a | 2023-06-10 | red | 0 | 1 |
a | 2023-06-11 | red | 0 | 1 |
a | 2023-07-03 | green | 1 | 2 |
a | 2023-07-09 | green | 0 | 2 |
a | 2024-01-11 | green | 0 | 2 |
a | 2024-02-11 | yellow | 1 | 3 |
a | 2024-02-12 | yellow | 0 | 3 |
a | 2024-02-13 | yellow | 0 | 3 |
a | 2024-02-14 | yellow | 0 | 3 |
b | 2022-10-20 | blue | 0 | 1 |
b | 2022-10-21 | blue | 0 | 1 |
b | 2022-10-22 | blue | 0 | 1 |
b | 2022-10-23 | brown | 1 | 2 |
b | 2022-10-24 | brown | 0 | 2 |
b | 2022-10-25 | brown | 0 | 2 |
b | 2022-10-26 | blue | 1 | 3 |
b | 2022-10-27 | blue | 0 | 3 |