I have a big table of timestamps in Postgres 9.4.5:
CREATE TABLE vessel_position (
posid serial NOT NULL,
mmsi integer NOT NULL,
"timestamp" timestamp with time zone,
the_geom geometry(PointZ,4326),
CONSTRAINT "PK_posid_mmsi" PRIMARY KEY (posid, mmsi)
);
Additional index:
CREATE INDEX vessel_position_timestamp_idx ON vessel_position ("timestamp");
I want to extract every row where the timestamp is at least x minutes after the previous row. I've tried a few different SELECT
statements using LAG()
which all kind of worked, but didn't give me the exact result I require. The below functions gives me what I need, but I feel it could be quicker:
CREATE OR REPLACE FUNCTION _getVesslTrackWithInterval(mmsi integer, startTime character varying (25) ,endTime character varying (25), interval_min integer)
RETURNS SETOF vessel_position AS
$func$
DECLARE
count integer DEFAULT 0;
posids varchar DEFAULT '';
tbl CURSOR FOR
SELECT
posID
,EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY posid asc)))::int as diff
FROM vessel_position vp WHERE vp.mmsi = $1 AND vp.timestamp BETWEEN $2::timestamp AND $3::timestamp;
BEGIN
FOR row IN tbl
LOOP
count := coalesce(row.diff,0) + count;
IF count >= $4*60 OR count = 0 THEN
posids:= posids || row.posid || ',';
count:= 0;
END IF;
END LOOP;
RETURN QUERY EXECUTE 'SELECT * from vessel_position where posid in (' || TRIM(TRAILING ',' FROM posids) || ')';
END
$func$ LANGUAGE plpgsql;
I can't help thinking getting all the posids
as a string and then selecting them all again at the very end is slowing things down.
Within the IF
statement, I already have access to each row I want to keep, so could potentially store them in a temp table and then return temp table at the end of the loop.
Can this function be optimised - to improve performance in particular?
Your function has all kinds of expensive, unnecessary overhead. A single query should be many times faster, doing the same:
CREATE OR REPLACE FUNCTION _get_vessel_track_with_interval
(mmsi int, starttime timestamptz, endtime timestamptz, min_interval interval)
RETURNS SETOF vessel_position AS
$func$
BEGIN
SELECT (vp).* -- parentheses required for decomposing row type
FROM (
SELECT vp -- whole row (!)
, timestamp - lag(timestamp) OVER (ORDER BY posid) AS diff
FROM vessel_position vp
WHERE vp.mmsi = $1
AND vp.timestamp >= $2 -- typically you'd include the lower bound
AND vp.timestamp < $3; -- ... and exlude the upper
ORDER BY posid
) sub
WHERE diff >= $4;
END
$func$ LANGUAGE plpgsql STABLE;
Could also just be an SQL function or the bare SELECT
without any wrapper (Maybe as prepared statement? Example.)
Note how starttime
and endtime
are passed as timestamp
. (Makes no sense to pass as text
and cast.) And the minimum interval min_interval
is an actual interval
. Pass any interval of your choosing.
If the predicate on mmsi
is in any way selective, the two indexes you currently have (PK ON (posid, mmsi)
and idx on (timestamp)
) are not very useful. If you reverse the column order of your PK to (mmsi, posid)
, it becomes far more useful for the query at hand. See:
The optimal index for this would typically be on vessel_position(mmsi, timestamp)
. Related:
Aside: Avoid keywords as identifiers. That's asking for trouble. Plus, a column timestamp
that actually holds timestamptz
is misleading.