Search code examples
sqlpostgresqlplpgsqlwindow-functionsgaps-and-islands

Optimising function which extracts records with a minimum gap in timestamps


I have a big table of timestamps in Postgres 9.4.5:

CREATE TABLE vessel_position (
  posid serial NOT NULL,
  mmsi integer NOT NULL,
  "timestamp" timestamp with time zone,
  the_geom geometry(PointZ,4326),
  CONSTRAINT "PK_posid_mmsi" PRIMARY KEY (posid, mmsi)
);

Additional index:

CREATE INDEX vessel_position_timestamp_idx ON vessel_position ("timestamp");

I want to extract every row where the timestamp is at least x minutes after the previous row. I've tried a few different SELECT statements using LAG() which all kind of worked, but didn't give me the exact result I require. The below functions gives me what I need, but I feel it could be quicker:

CREATE OR REPLACE FUNCTION _getVesslTrackWithInterval(mmsi integer, startTime character varying (25) ,endTime character varying (25), interval_min integer)
RETURNS SETOF vessel_position AS
$func$
DECLARE
    count integer DEFAULT 0;
    posids varchar DEFAULT '';
    tbl CURSOR FOR
    SELECT
      posID
      ,EXTRACT(EPOCH FROM (timestamp -  lag(timestamp) OVER (ORDER BY posid asc)))::int as diff
    FROM vessel_position vp WHERE vp.mmsi = $1  AND vp.timestamp BETWEEN $2::timestamp AND $3::timestamp;
BEGIN
FOR row IN tbl
LOOP
    count := coalesce(row.diff,0) + count;
    IF count >= $4*60 OR count = 0 THEN
            posids:= posids || row.posid || ',';
            count:= 0;
     END IF;
END LOOP;
RETURN QUERY EXECUTE 'SELECT * from vessel_position where posid in (' || TRIM(TRAILING ',' FROM posids) || ')';
END
$func$ LANGUAGE plpgsql;

I can't help thinking getting all the posids as a string and then selecting them all again at the very end is slowing things down. Within the IF statement, I already have access to each row I want to keep, so could potentially store them in a temp table and then return temp table at the end of the loop.

Can this function be optimised - to improve performance in particular?


Solution

  • Query

    Your function has all kinds of expensive, unnecessary overhead. A single query should be many times faster, doing the same:

    CREATE OR REPLACE FUNCTION _get_vessel_track_with_interval
     (mmsi int, starttime timestamptz, endtime timestamptz, min_interval interval)
      RETURNS SETOF vessel_position AS
    $func$
    BEGIN
       SELECT (vp).*  -- parentheses required for decomposing row type
       FROM  (
          SELECT vp   -- whole row (!)
               , timestamp - lag(timestamp) OVER (ORDER BY posid) AS diff
          FROM   vessel_position vp
          WHERE  vp.mmsi = $1
          AND    vp.timestamp >= $2     -- typically you'd include the lower bound
          AND    vp.timestamp <  $3;    -- ... and exlude the upper
          ORDER  BY posid
          ) sub
       WHERE  diff >= $4;
    END
    $func$  LANGUAGE plpgsql STABLE;
    

    Could also just be an SQL function or the bare SELECT without any wrapper (Maybe as prepared statement? Example.)

    Note how starttime and endtime are passed as timestamp. (Makes no sense to pass as text and cast.) And the minimum interval min_interval is an actual interval. Pass any interval of your choosing.

    Index

    If the predicate on mmsi is in any way selective, the two indexes you currently have (PK ON (posid, mmsi) and idx on (timestamp)) are not very useful. If you reverse the column order of your PK to (mmsi, posid), it becomes far more useful for the query at hand. See:

    The optimal index for this would typically be on vessel_position(mmsi, timestamp). Related:

    Aside: Avoid keywords as identifiers. That's asking for trouble. Plus, a column timestamp that actually holds timestamptz is misleading.