Search code examples
sqlpostgresqltimestamp

Select * but group by similar timestamps (not in fixed intervals)


I've seen a lot of grouping by timestamps with fixed intervals (e.g. 0-10, 11-20 mins) etc. However I want to ignore duplicate date in a database which have Similar timestamps, that in theory could span fixed width intervals e.g. 10:09 and 10:11. So I want the intervals to be dynamic and relative to each row.

What's the best approach to show all the "unique" entries, grabbing only the first entry where the timestamps are +/- 5 minutes on the others.

Thanks


Solution

  • This is a gaps and islands problem. Let's start by creating a table of test values and populating it:

    CREATE TABLE test_values AS (
    SELECT id, started_at::TIMESTAMP AS started_at
          FROM (VALUES (1, '2024-07-01 00:00:00'),
                       (2, '2024-07-01 00:10:00'),
                       (3, '2024-07-01 00:15:00'),
                       (4, '2024-07-01 00:19:00'),
                       (5, '2024-07-01 00:25:00')) v (id, started_at));
    

    The following query demonstrates an approach to finding the starting times of each cluster:

    WITH
      params AS (SELECT INTERVAL '5' MINUTE AS spread),
      ng AS (
        SELECT tv.id,
               tv.started_at,
               tv.started_at - LAG(tv.started_at) OVER (ORDER BY tv.started_at) > params.spread IS NOT FALSE AS new_group
          FROM params
          CROSS JOIN test_values tv)
    SELECT ng.*
      FROM ng
      WHERE ng.new_group
      ORDER BY ng.started_at;
    

    The query works by identifying each started_at that neither overlaps nor is contiguous with the most recent earlier time plus spread. The comparison is checked with IS NOT FALSE so that the first time will be identified as starting a group:

    id started_at new_group
    1 2024-07-01 00:00:00 true
    2 2024-07-01 00:10:00 true
    5 2024-07-01 00:25:00 true

    If > is changed to >=, then clusters will consist of times that are within spread of adjacent times; i.e, adjacent times that are exactly spread apart will be in different clusters:

    id started_at new_group
    1 2024-07-01 00:00:00 true
    2 2024-07-01 00:10:00 true
    3 2024-07-01 00:15:00 true
    5 2024-07-01 00:25:00 true