Search code examples
postgresqltime-serieswindow-functionstimescaledb

Explode (duplicate) records based on timestamp range (PostgreSQL)


I'm trying to convert a timeseries recordset into something a bit more suitable for data analysis. Consider this following contiguous recordset (From, To, Value)

"2019-10-03 03:58:21+00"    "2019-10-03 03:59:04+00"    10
"2019-10-03 03:59:04+00"    "2019-10-03 03:59:54+00"    15
"2019-10-03 03:59:54+00"    "2019-10-03 04:02:00+00"    20
"2019-10-03 04:02:00+00"    "2019-10-03 04:02:10+00"    25

I want one record at the start of each minute, at the cost of "duplicated" rows.

"2019-10-03 03:58:21+00"    "2019-10-03 03:59:00+00"    10
"2019-10-03 03:59:00+00"    "2019-10-03 03:59:04+00"    10
"2019-10-03 03:59:04+00"    "2019-10-03 03:59:54+00"    15
"2019-10-03 03:59:54+00"    "2019-10-03 04:00:00+00"    20
"2019-10-03 04:00:00+00"    "2019-10-03 04:01:00+00"    20
"2019-10-03 04:01:00+00"    "2019-10-03 04:02:00+00"    20
"2019-10-03 04:02:00+00"    "2019-10-03 04:02:10+00"    25

First row is duplicated once because it overlaps a minute. Second is not duplicated because it's within a minute. Third is duplicated twice because it overlaps 3 minutes. Last is not exploded because it's within a minute (but it also starts at a minute). The value itself remains the same across exploded rows.

I have the feeling that my solution is a combination of Window functions (lead/lag?), generate_series() and various date_part / date_trunc calls. Possibly a cross or lateral self-join...

Currently using PostgreSQL 13.4 so I should have access to the latest and greatest API. I'm also using timescaledb 2.4.2 and hypertables if any of this helps, though their time_bucket function seems more about reducing row count, not increasing it.

Would love a push in the right direction!


Solution

  • As I expected, it was a combination of generate_series and window functions. But I didn't expect to have to create my own locf function, I thought LEAD/LAG had options to remember the last known / non-null values.

    The following code takes a few known records, and unions them against a generated series of timestamps.

    I needed to use DISTINCT ON to purge the generated records that already had known equivalents.

    Then I can finally use LEAD for the "next date" and locf_any for the value carried forward.

    --https://www.joyofdata.de/blog/locf-linear-imputation-postgresql-tutorial/
    
    DROP FUNCTION locf_s(ANYELEMENT, ANYELEMENT) CASCADE;
    CREATE OR REPLACE FUNCTION locf_s(a ANYELEMENT, b ANYELEMENT)
    RETURNS ANYELEMENT
    LANGUAGE sql
    AS '
      SELECT COALESCE(b, a)
    ';
    
    DROP AGGREGATE IF EXISTS locf_any(ANYELEMENT);
    CREATE AGGREGATE locf_any(ANYELEMENT) (
      SFUNC = locf_s,
      STYPE = ANYELEMENT
    );
    
    SELECT from_time, LEAD(from_time) OVER W, locf_any(reading) OVER W
    FROM
    (
      SELECT DISTINCT ON (from_time) from_time, reading FROM
      (
        WITH readings (from_time, reading) AS (VALUES
         ('2019-10-03 03:58:21+00', 10),
         ('2019-10-03 03:59:04+00', 15),
         ('2019-10-03 03:59:54+00', 20),
         ('2019-10-03 04:02:00+00', 25)
        )
        (
          SELECT from_time::TIMESTAMPTZ, reading::INTEGER FROM readings
          UNION ALL
          SELECT generate_series('2019-10-03 03:59:00+00', '2019-10-03 04:04:00+00', '1 minute'::INTERVAL)::TIMESTAMPTZ, NULL
        )
      ) X
      ORDER BY from_time, reading NULLS LAST
    ) Y
    WINDOW W AS (ORDER BY from_time ASC);