Search code examples
sqlwindow-functionsprestoamazon-athena

Get running count and sums of value of all rows that are started but not yet finished for each second


I have some event data that looks like this:

| time                    | id | status   | value |
|-------------------------|----|----------|-------|
| 2020-08-26T21:29:01.000 | 2  | started  | 8     |
| 2020-08-26T21:29:01.000 | 3  | started  | 4     |
| 2020-08-26T21:29:02.000 | 2  | finished | 8     |
| 2020-08-26T21:29:03.000 | 4  | started  | 12    |
| 2020-08-26T21:29:04.000 | 5  | started  | 2     |
| 2020-08-26T21:29:05.000 | 6  | started  | 24    |
| 2020-08-26T21:29:06.000 | 4  | finished | 12    |
| 2020-08-26T21:29:06.000 | 3  | finished | 4     |
| 2020-08-26T21:29:07.000 | 1  | finished | 1     |
| 2020-08-26T21:29:10.000 | 7  | started  | 4     |

Note that the event data was started recording after things had started and there are events that haven't yet finished.

I'm trying to then get a running count of the rows and running sum of the values per second.

As soon as I think running count I think of window queries but I'm struggling to work out how I get from this data to the output I expect.

I'd ideally then expect to get the following results:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 1     | 1          |
| 2020-08-26T21:29:01.000 | 3     | 13         |
| 2020-08-26T21:29:02.000 | 2     | 5          |
| 2020-08-26T21:29:03.000 | 3     | 17         |
| 2020-08-26T21:29:04.000 | 4     | 19         |
| 2020-08-26T21:29:05.000 | 5     | 43         |
| 2020-08-26T21:29:06.000 | 3     | 29         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

I'd also be happy enough with an answer that didn't take into account the 1 id record that was running before events began recording which would then have the following results:

| time                    | count | sum_values |
|-------------------------|-------|------------|
| 2020-08-26T21:29:00.000 | 0     | 0          |
| 2020-08-26T21:29:01.000 | 2     | 12         |
| 2020-08-26T21:29:02.000 | 1     | 4          |
| 2020-08-26T21:29:03.000 | 2     | 16         |
| 2020-08-26T21:29:04.000 | 3     | 18         |
| 2020-08-26T21:29:05.000 | 4     | 42         |
| 2020-08-26T21:29:06.000 | 2     | 28         |
| 2020-08-26T21:29:07.000 | 2     | 28         |
| 2020-08-26T21:29:08.000 | 2     | 28         |
| 2020-08-26T21:29:09.000 | 2     | 28         |
| 2020-08-26T21:29:10.000 | 3     | 32         |
| 2020-08-26T21:29:11.000 | 3     | 32         |

As Athena/Presto doesn't support full joins I was able to get the start and stop times for each id with the following query (also on SQL Fiddle):

WITH started AS (
  SELECT *
  FROM foo
  WHERE status = 'started'
), finished AS (
  SELECT *
  FROM foo
  WHERE status = 'finished'
)
SELECT started.time AS started_time, finished.time AS finished_time, started.id, started.value
FROM started LEFT JOIN finished ON started.id = finished.id

Solution

  • I think you want a cumulative conditional sum:

    select time,
           sum(sum(case when status = 'started' then 1
                        when status = 'finished' then -1
                   end)
              ) over (order by time) as running_count,
           sum(sum(case when status = 'started' then value
                        when status = 'finished' then - value
                    end)
              ) over (order by time) as running_value
    from foo
    group by time
    

    The sum()s need to be nested because one is needed for the window function and the other is for the aggregation.