sql postgresql plpgsql window-functions gaps-and-islands

Optimising function which extracts records with a minimum gap in timestamps

I have a big table of timestamps in Postgres 9.4.5:

CREATE TABLE vessel_position (
  posid serial NOT NULL,
  mmsi integer NOT NULL,
  "timestamp" timestamp with time zone,
  the_geom geometry(PointZ,4326),
  CONSTRAINT "PK_posid_mmsi" PRIMARY KEY (posid, mmsi)
);

Additional index:

CREATE INDEX vessel_position_timestamp_idx ON vessel_position ("timestamp");

I want to extract every row where the timestamp is at least x minutes after the previous row. I've tried a few different SELECT statements using LAG() which all kind of worked, but didn't give me the exact result I require. The below functions gives me what I need, but I feel it could be quicker:

CREATE OR REPLACE FUNCTION _getVesslTrackWithInterval(mmsi integer, startTime character varying (25) ,endTime character varying (25), interval_min integer)
RETURNS SETOF vessel_position AS
$func$
DECLARE
    count integer DEFAULT 0;
    posids varchar DEFAULT '';
    tbl CURSOR FOR
    SELECT
      posID
      ,EXTRACT(EPOCH FROM (timestamp -  lag(timestamp) OVER (ORDER BY posid asc)))::int as diff
    FROM vessel_position vp WHERE vp.mmsi = $1  AND vp.timestamp BETWEEN $2::timestamp AND $3::timestamp;
BEGIN
FOR row IN tbl
LOOP
    count := coalesce(row.diff,0) + count;
    IF count >= $4*60 OR count = 0 THEN
            posids:= posids || row.posid || ',';
            count:= 0;
     END IF;
END LOOP;
RETURN QUERY EXECUTE 'SELECT * from vessel_position where posid in (' || TRIM(TRAILING ',' FROM posids) || ')';
END
$func$ LANGUAGE plpgsql;

I can't help thinking getting all the posids as a string and then selecting them all again at the very end is slowing things down. Within the IF statement, I already have access to each row I want to keep, so could potentially store them in a temp table and then return temp table at the end of the loop.

Can this function be optimised - to improve performance in particular?

Solution

Query

Your function has all kinds of expensive, unnecessary overhead. A single query should be many times faster, doing the same:

CREATE OR REPLACE FUNCTION _get_vessel_track_with_interval
 (mmsi int, starttime timestamptz, endtime timestamptz, min_interval interval)
  RETURNS SETOF vessel_position AS
$func$
BEGIN
   SELECT (vp).*  -- parentheses required for decomposing row type
   FROM  (
      SELECT vp   -- whole row (!)
           , timestamp - lag(timestamp) OVER (ORDER BY posid) AS diff
      FROM   vessel_position vp
      WHERE  vp.mmsi = $1
      AND    vp.timestamp >= $2     -- typically you'd include the lower bound
      AND    vp.timestamp <  $3;    -- ... and exlude the upper
      ORDER  BY posid
      ) sub
   WHERE  diff >= $4;
END
$func$  LANGUAGE plpgsql STABLE;

Could also just be an SQL function or the bare SELECT without any wrapper (Maybe as prepared statement? Example.)

Note how starttime and endtime are passed as timestamp. (Makes no sense to pass as text and cast.) And the minimum interval min_interval is an actual interval. Pass any interval of your choosing.

Index

If the predicate on mmsi is in any way selective, the two indexes you currently have (PK ON (posid, mmsi) and idx on (timestamp)) are not very useful. If you reverse the column order of your PK to (mmsi, posid), it becomes far more useful for the query at hand. See:

Is a composite index also good for queries on the first field?

The optimal index for this would typically be on vessel_position(mmsi, timestamp). Related:

Aside: Avoid keywords as identifiers. That's asking for trouble. Plus, a column timestamp that actually holds timestamptz is misleading.