sql postgresql gaps-and-islands postgresql-12

Generate new UUID per row depending on gaps in existing time series

I have a table1 with timestamps, grouped by an added “Period ID”:

Period ID (uuid)    Created At (timestamptz)
  A                 2017-04-11 11:13:47.997+00
  A                 2017-04-11 14:42:51.843+00
  B                 2017-05-21 15:18:12.973+00
  B                 2017-05-21 15:28:41.054+00
  B                 2017-05-21 15:28:57.577+00
  C                 2017-06-11 22:48:00.637+00
  D                 2017-07-15 14:45:52.343+00
  D                 2017-07-15 14:47:53.343+00
  E                 2017-08-25 16:22:57.612+00
  F                 2017-09-16 00:10:18.577+00

I need to create a new ID “Period ID 2” as a subset of Period ID, such that:

A UUID is populated in the first row in the new column “Period ID 2”.
A new Period ID 2 is generated for each different Period ID value (Ex: All timestamps with Period ID “A” can have same “Period ID 2”, timestamps for Period ID “B” can all have the same “Period ID 2”. BUT the “Period ID 2” for A and B cannot be the same )
For records with the same “Period ID”, if the difference between consecutive timestamps is more than 10 mins, then a new “Period ID 2” should be generated. (Example : for the records with “Period ID” = B, the interval between the first and second timestamp is more than 10 minutes, so a new “Period ID 2” will be generated for the second value. The interval between 2nd and 3rd is less than 10 mins, so the 2nd and 3rd records will have the same “Period ID 2”).

So, the output should be like this:

Period ID (uuid)    Created At (timestamptz)    Period2 (uuid)
  A                 2017-04-11 11:13:47.997+00      1
  A                 2017-04-11 14:42:51.843+00      2
  B                 2017-05-21 15:18:12.973+00      3
  B                 2017-05-21 15:28:41.054+00      4
  B                 2017-05-21 15:28:57.577+00      4
  C                 2017-06-11 22:48:00.637+00      5
  D                 2017-07-15 14:45:52.343+00      6
  D                 2017-07-15 14:47:53.343+00      6
  E                 2017-08-25 16:22:57.612+00      7
  F                 2017-09-16 00:10:18.577+00      8

I can add the column, but I am not sure how to populate it. I looked into the following flow: ALTER TABLE, CREATE FUNCTION (for trigger function), CREATE TRIGGER, ... But I wasn’t able to query it properly. Is there any other way to approach this problem?

Solution

This does what you ask:

SELECT period_id, created_at, md5(period_id::text || grp)::uuid AS period_id2
FROM  (
   SELECT *, count(*) FILTER (WHERE step) OVER (PARTITION BY period_id ORDER BY created_at) AS grp
   FROM  (
      SELECT *, (created_at - lag(created_at) OVER (PARTITION BY period_id ORDER BY created_at)) > interval '10 min' AS step
      FROM  tbl
      ) sub1
   ) sub2;

Or with just a single subquery - albeit less readable:

SELECT period_id, created_at
     , md5(period_id::text || count(*) FILTER (WHERE step) OVER (PARTITION BY period_id ORDER BY created_at))::uuid AS period_id2
FROM  (
   SELECT *, (created_at - lag(created_at) OVER (PARTITION BY period_id ORDER BY created_at)) > interval '10 min' AS step
   FROM  tbl
   ) sub;

db<>fiddle here

I chose a deterministic way to generate UUIDs. So the Nth group within the same period_id always gets the same, reproducible UUID.

Basic explanation for this query technique:

Select longest continuous sequence

About md5() and UUID:

Maybe you don't actually need UUIDs to begin with? I see them used a lot without need. See:

PostgreSQL using UUID vs Text as primary key

Persist as table?

If you are at liberty to replace the existing table, just creating a new one will be fastest:

CREATE TABLE tbl2 AS
SELECT period_id, created_at
     , md5(period_id::text || count(*) FILTER (WHERE step) OVER (PARTITION BY period_id ORDER BY created_at))::uuid AS period_id2
FROM  (
   SELECT *, (created_at - lag(created_at) OVER (PARTITION BY period_id ORDER BY created_at)) > interval '10 min' AS step
   FROM  tbl
   ) sub;

If you write to the table, period_id2 for all involved period_id have to be recalculated. In case of INSERT only the ones later than the new created_at (plus the new row itself). With all the added bloat and vacuum cost to the table, this gets expensive quickly.

Maybe consider a MATERIALIZED VIEW. But at some point, if the write costs outweigh read cost, it will be cheaper not to persist period_id2 at all and compute it on the fly. Heavily depends on actual read and write patterns.