Search code examples
sqltime-seriescumulative-sum

SQL: How can I generate a time series from timestamp data and calculate cumulative sums across different event types?


Building a timeseries with cumulative sums based on singe grouped events

To begin I need to highlight I am quite new to SQL and the current state was achieved with the help of stack overflow.

I have a table containing events of different kinds as they occur. I would like to transform this data into a time series filling the time gaps between the events and cumulating the numbers across the different kind of events.

So it is all about creating a timeseries from single events and calculate the running/cumulative sum across groups of events

Here the example of source data:

Source:

event_timestamp    type  value
01.01.2023 10:00   1     10
03.01.2023 10:00   2     10
05.01.2023 10:00   2     10
07.01.2023 10:00   1     10

Desired outcome:

event_timestamp    type value cumulative_sum
01.01.2023 10:00   1    10    10
02.01.2023 10:00   1    0     10
03.01.2023 10:00   1    0     10
03.01.2023 10:00   2    10    10
04.01.2023 10:00   1    0     10
04.01.2023 10:00   2    0     10
05.01.2023 10:00   1    0     10
05.01.2023 10:00   2    10    20
06.01.2023 10:00   1    0     10
06.01.2023 10:00   2    0     20
07.01.2023 10:00   1    10    20
07.01.2023 10:00   2    0     20

I got to a point, where I can create this (limited to a single event type):

time               type value cumulative_sum
01.01.2023 10:00   1    10    10
02.01.2023 10:00   1    0     10
03.01.2023 10:00   1    10    20
04.01.2023 10:00   1    0     20
05.01.2023 10:00   1    0     20
06.01.2023 10:00   1    0     20
07.01.2023 10:00   1    0     20

with the following sql statement (postgresql):

SELECT  
generate_series AS timestamp,
-- hard coded event type below
COALESCE(events.type, 1) AS type,
COALESCE(events.value, 0) AS value,
COALESCE(SUM(td.value) OVER (ORDER BY generate_series), 0) AS cumulative_sum  
FROM  
generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series  
LEFT JOIN  
-- hard coded event type below
events ON generate_series = events.event_timestamp AND event.type = 1 
ORDER BY  
generate_series;

Now I am questioning if I should try to find the SQL statement doing all in a better way (no idea how) or if I should use a python script to just run this statement per event type and sequentally , insert the data into the table.

Questions summarized:

  • Would it be advisable to do such a calculation with a combination of SQL and e.g. python?
  • Would it be better to split the creation of the timeseries and the cumulative sum?
  • If the pure SQL way is recommended, how can this done considering the groups

Solution

  • You're very close. All you need is to create the series for each type. A cross join can do that easily.

    Check this out:

    SELECT generate_series AS timestamp
         , etypes.type
         , COALESCE(events.value, 0) AS value
         , COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
      FROM generate_series('2023-01-01'::timestamp, '2023-01-07'::timestamp, '1 day') AS generate_series
      CROSS JOIN (SELECT DISTINCT type FROM events) AS etypes
      LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
     ORDER BY generate_series, etypes.type
    ;
    

    With no data in the given range, the result is this:

    timestamp type value cumulative_sum
    2023-01-01 00:00:00 1 0 0
    2023-01-01 00:00:00 3 0 0
    2023-01-02 00:00:00 1 0 0
    2023-01-02 00:00:00 3 0 0
    2023-01-03 00:00:00 1 0 0
    2023-01-03 00:00:00 3 0 0
    2023-01-04 00:00:00 1 0 0
    2023-01-04 00:00:00 3 0 0
    2023-01-05 00:00:00 1 0 0
    2023-01-05 00:00:00 3 0 0
    2023-01-06 00:00:00 1 0 0
    2023-01-06 00:00:00 3 0 0
    2023-01-07 00:00:00 1 0 0
    2023-01-07 00:00:00 3 0 0

    Test case (with a little data added)

    You could also create the series with different range for each type, based on the start and end for that type.

    Here's an example of that:

    SELECT generate_series AS timestamp
         , etypes.type
         , COALESCE(events.value, 0) AS value
         , COALESCE(SUM(events.value) OVER (PARTITION BY etypes.type ORDER BY generate_series), 0) AS cumulative_sum
      FROM (SELECT type
                 , MIN(event_timestamp) AS mints
                 , MAX(event_timestamp) AS maxts
              FROM events
             GROUP BY type
           ) AS etypes
      JOIN LATERAL (SELECT generate_series(mints, maxts, '1 day') AS generate_series) AS gs ON 1 = 1
      LEFT JOIN events ON generate_series = events.event_timestamp AND events.type = etypes.type
     ORDER BY generate_series, etypes.type
    ;
    

    Result with your data:

    timestamp type value cumulative_sum
    2023-01-01 00:00:00 1 10 10
    2023-01-02 00:00:00 1 0 10
    2023-01-03 00:00:00 1 0 10
    2023-01-03 00:00:00 2 10 10
    2023-01-04 00:00:00 1 0 10
    2023-01-04 00:00:00 2 0 10
    2023-01-05 00:00:00 1 0 10
    2023-01-05 00:00:00 2 10 20
    2023-01-06 00:00:00 1 0 10
    2023-01-07 00:00:00 1 10 20