let's say I have a table with 3 columns: id, date_time, color. The data looks like this:
id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow
I need to count how often the value "yellow" appears in the "color" column , grouped by the "id" column, ordered by date_time. However, I have specific conditions:
It looks like subwindows in a group.
I'm using SQL Presto in AWS Athena, and I believe I should use window functions, but I'm unsure how to specify these conditions.
Thanks in advance for the tips
So, the expected results should be:
For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1
I tried this, but there is a lack of a cumulative counter for repeating yellow within conditions.
`with testdata(id, date_time, color) as (
VALUES
(1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
SELECT id,
date_time,
color,
LAG(color) OVER (
PARTITION BY id
ORDER BY date_time
) AS prev_color,
LEAD(color) OVER (
PARTITION BY id
ORDER BY date_time
) AS next_color
FROM testdata
)
select id,
SUM(
CASE
WHEN color = 'yellow'
AND (
prev_color = 'green'
and (
next_color IS NULL
OR next_color = 'red'
OR next_color = 'yellow'
)
) THEN 1 ELSE 0
END
) AS yellow_count
FROM t1
group by id`
I got values
You have a gaps and islands problem, you could use the difference between two row_numbers to give a unique id to each successive group of rows :
with data (id, date_time, color) as (
VALUES
(1, cast('2023-10-01 12:15:00' as timestamp), 'green'),(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
),
cte AS (
SELECT id, date_time, color,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
- ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
FROM data
),
cte2 AS (
SELECT id, rn,
MAX(color) AS color,
MAX(date_time) AS date_time,
SUM(CASE WHEN color = 'yellow' THEN 1 END) AS total
FROM cte c1
GROUP BY id, rn
ORDER BY date_time
),
cte3 AS (
SELECT *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color
FROM cte2
)
SELECT id, SUM(total) AS total
FROM cte3
WHERE color = 'yellow' AND prev_color = 'green' AND ( next_color = 'red' OR next_color is null)
GROUP BY id
Explanations :
Use the difference between two row_numbers to give a unique id to each successive group of rows.
Group data by id and rn so it can be easy to get previous and next value for each color.
Then apply your conditions on the result from cte3 to get the expected data.
Result :
id | total |
---|---|
1 | 4 |
2 | 1 |