Search code examples
sqlwindow-functionsamazon-athenaprestogaps-and-islands

Perform Value Count with Specific Conditions inside subwindows SQL Presto


let's say I have a table with 3 columns: id, date_time, color. The data looks like this:

id, date_time, color
1, 2023-10-01 12:15, green
1, 2023-10-01 12:16, yellow
1, 2023-10-01 12:17, yellow
1, 2023-10-01 12:18, red
1, 2023-10-01 12:19, yellow
1, 2023-10-01 12:20, yellow
1, 2023-10-01 12:21, red
1, 2023-10-01 12:22, red
1, 2023-10-01 12:23, green
1, 2023-10-01 12:24, yellow
1, 2023-10-01 12:25, yellow
1, 2023-10-01 12:26, red
2, 2023-10-01 12:27, red
2, 2023-10-01 12:28, green
2, 2023-10-01 12:29, green
2, 2023-10-01 12:30, yellow

I need to count how often the value "yellow" appears in the "color" column , grouped by the "id" column, ordered by date_time. However, I have specific conditions:

  1. I want to count "yellow" only if it appears after "green".
  2. I want to count "yellow" only if it's followed by either the first "red" or if it's the last value in the group defined by "id".

It looks like subwindows in a group.

I'm using SQL Presto in AWS Athena, and I believe I should use window functions, but I'm unsure how to specify these conditions.

Thanks in advance for the tips

So, the expected results should be:

For id=1: Count "yellow" = 4
For id=2: Count "yellow" = 1

I tried this, but there is a lack of a cumulative counter for repeating yellow within conditions.

`with testdata(id, date_time, color) as (
  VALUES 
  (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:18:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:21:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:22:00' as timestamp), 'red'),
  (1, cast('2023-10-01 12:23:00' as timestamp), 'green'),
  (1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),
  (1, cast('2023-10-01 12:26:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:27:00' as timestamp), 'red'),
  (2, cast('2023-10-01 12:28:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:29:00' as timestamp), 'green'),
  (2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
)
,t1 as (
  SELECT id,
    date_time,
    color,
    LAG(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS prev_color,
    LEAD(color) OVER (
      PARTITION BY id
      ORDER BY date_time
    ) AS next_color
  FROM testdata
)
select id,
  SUM(
    CASE
      WHEN color = 'yellow'
      AND (
        prev_color = 'green'
        and (
          next_color IS NULL
          OR next_color = 'red'
          OR next_color = 'yellow'
        )
      ) THEN 1 ELSE 0
    END
  )  AS yellow_count
FROM t1
group by id`

I got values

  • For id=1: Count "yellow" = 2 (incorrect)
  • For id=2: Count "yellow" = 1 (correct)

Solution

  • You have a gaps and islands problem, you could use the difference between two row_numbers to give a unique id to each successive group of rows :

    with data (id, date_time, color) as (
      VALUES 
      (1, cast('2023-10-01 12:15:00' as timestamp), 'green'),(1, cast('2023-10-01 12:16:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:17:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:18:00' as timestamp), 'red'),(1, cast('2023-10-01 12:19:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:20:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:21:00' as timestamp), 'red'),(1, cast('2023-10-01 12:22:00' as timestamp), 'red'),(1, cast('2023-10-01 12:23:00' as timestamp), 'green'),(1, cast('2023-10-01 12:24:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:25:00' as timestamp), 'yellow'),(1, cast('2023-10-01 12:26:00' as timestamp), 'red'),(2, cast('2023-10-01 12:27:00' as timestamp), 'red'),(2, cast('2023-10-01 12:28:00' as timestamp), 'green'),(2, cast('2023-10-01 12:29:00' as timestamp), 'green'),(2, cast('2023-10-01 12:30:00' as timestamp), 'yellow')
    ),
    cte AS (
      SELECT id, date_time, color, 
             ROW_NUMBER() OVER (PARTITION BY id ORDER BY date_time)
             - ROW_NUMBER() OVER (PARTITION BY id, color ORDER BY date_time) AS rn
      FROM data
    ),
    cte2 AS (
      SELECT id, rn, 
             MAX(color) AS color,
             MAX(date_time) AS date_time,
             SUM(CASE WHEN color = 'yellow' THEN 1 END) AS total
      FROM cte c1
      GROUP BY id, rn
      ORDER BY date_time
    ),
    cte3 AS (
      SELECT *, LAG(color) OVER ( PARTITION BY id ORDER BY date_time) AS prev_color,
              LEAD(color) OVER ( PARTITION BY id ORDER BY date_time) AS next_color 
      FROM cte2
    )
    SELECT id, SUM(total) AS total
    FROM cte3
    WHERE color = 'yellow' AND prev_color = 'green' AND ( next_color = 'red' OR next_color is null)
    GROUP BY id
    

    Explanations :

    1. Use the difference between two row_numbers to give a unique id to each successive group of rows.

    2. Group data by id and rn so it can be easy to get previous and next value for each color.

    3. Then apply your conditions on the result from cte3 to get the expected data.

    Result :

    id total
    1 4
    2 1

    Demo on postgresql