I'm trying to convert a timeseries recordset into something a bit more suitable for data analysis. Consider this following contiguous recordset (From, To, Value)
"2019-10-03 03:58:21+00" "2019-10-03 03:59:04+00" 10
"2019-10-03 03:59:04+00" "2019-10-03 03:59:54+00" 15
"2019-10-03 03:59:54+00" "2019-10-03 04:02:00+00" 20
"2019-10-03 04:02:00+00" "2019-10-03 04:02:10+00" 25
I want one record at the start of each minute, at the cost of "duplicated" rows.
"2019-10-03 03:58:21+00" "2019-10-03 03:59:00+00" 10
"2019-10-03 03:59:00+00" "2019-10-03 03:59:04+00" 10
"2019-10-03 03:59:04+00" "2019-10-03 03:59:54+00" 15
"2019-10-03 03:59:54+00" "2019-10-03 04:00:00+00" 20
"2019-10-03 04:00:00+00" "2019-10-03 04:01:00+00" 20
"2019-10-03 04:01:00+00" "2019-10-03 04:02:00+00" 20
"2019-10-03 04:02:00+00" "2019-10-03 04:02:10+00" 25
First row is duplicated once because it overlaps a minute. Second is not duplicated because it's within a minute. Third is duplicated twice because it overlaps 3 minutes. Last is not exploded because it's within a minute (but it also starts at a minute). The value itself remains the same across exploded rows.
I have the feeling that my solution is a combination of Window functions (lead
/lag
?), generate_series()
and various date_part
/ date_trunc
calls. Possibly a cross
or lateral
self-join...
Currently using PostgreSQL 13.4 so I should have access to the latest and greatest API. I'm also using timescaledb 2.4.2 and hypertables if any of this helps, though their time_bucket
function seems more about reducing row count, not increasing it.
Would love a push in the right direction!
As I expected, it was a combination of generate_series and window functions. But I didn't expect to have to create my own locf
function, I thought LEAD/LAG
had options to remember the last known / non-null values.
The following code takes a few known records, and unions them against a generated series of timestamps.
I needed to use DISTINCT ON
to purge the generated records that already had known equivalents.
Then I can finally use LEAD
for the "next date" and locf_any
for the value carried forward.
--https://www.joyofdata.de/blog/locf-linear-imputation-postgresql-tutorial/
DROP FUNCTION locf_s(ANYELEMENT, ANYELEMENT) CASCADE;
CREATE OR REPLACE FUNCTION locf_s(a ANYELEMENT, b ANYELEMENT)
RETURNS ANYELEMENT
LANGUAGE sql
AS '
SELECT COALESCE(b, a)
';
DROP AGGREGATE IF EXISTS locf_any(ANYELEMENT);
CREATE AGGREGATE locf_any(ANYELEMENT) (
SFUNC = locf_s,
STYPE = ANYELEMENT
);
SELECT from_time, LEAD(from_time) OVER W, locf_any(reading) OVER W
FROM
(
SELECT DISTINCT ON (from_time) from_time, reading FROM
(
WITH readings (from_time, reading) AS (VALUES
('2019-10-03 03:58:21+00', 10),
('2019-10-03 03:59:04+00', 15),
('2019-10-03 03:59:54+00', 20),
('2019-10-03 04:02:00+00', 25)
)
(
SELECT from_time::TIMESTAMPTZ, reading::INTEGER FROM readings
UNION ALL
SELECT generate_series('2019-10-03 03:59:00+00', '2019-10-03 04:04:00+00', '1 minute'::INTERVAL)::TIMESTAMPTZ, NULL
)
) X
ORDER BY from_time, reading NULLS LAST
) Y
WINDOW W AS (ORDER BY from_time ASC);