I have a table tab1 in timescale db which has 3 columns tag, time, value. time and tag make up the pk for the table: (time, tag).
There are more than 500 lakhs (50 000 000) of rows. I need to find the latest timestamp or the max(time) for each of the N tags.
There are few things that I tried, I'll share my experience with each:
SELECT "time", "tag", "value"
FROM tab1
WHERE ("tag","time") IN
(SELECT "tag", MAX("time") FROM tab1 WHERE "tag" IN(tag1,tag2) GROUP BY "tag" );
this is giving result but taking around 19 sec to execute which is beyond the acceptable limit
SELECT tag, last(time, time), last(value,time)
FROM tab1
WHERE "tag" IN (tag1,tag2) GROUP BY "tag" ;
this is giving output within 10 sec.
I need to find another feasible solution similar to 2nd one that is perhaps performing better. I tried few other things like LATERAL JOIN (3), WINDOW FUNCTIONS (ROW_NUMBER, PARTITION) (4) but the solution is not as expected.
SELECT table1."tag", table1."time",table1."value" from tab1 as table1
join lateral (
SELECT table2 ."tag",table2 ."time" from tab1 as table2
where table2."tag" = table1."tag"
order by table2."time" desc limit 1
) p on true
where table1."tag" in (tag1,tag2)
SELECT * from
( SELECT *, row_number() over (partition by tag order by time desc) as rownum
from tab1) a
where tag in (tag1,tag2)
Can anyone suggest what's wrong with the queries in 3,4 or suggest if there is another alternative.
There are a couple things that are going to help this and make this query easier and more performant. The first, and probably most important one is the index on the table / hypertable - it will need to be a multi-column index on tag, time desc - the order of the time doesn't matter all that much, but the order of columns in the index matters a lot. tag
has to be the first column here because we need to first search by tag, then get the latest time, if we have either separate indexes or if we're ordered by time first, this will be highly inefficient.
You can create this index with a call like this:
CREATE INDEX ON tab1 (tag, "time" DESC);
The next thing is the formulation of the query. The simplest way to get this for each tag is to write a DISTINCT ON
query. In Timescale we've optimized this sort of query. It's a bit of a weird way to formulate it so can be a bit hard to find though.
Basically you'd write it something like:
SELECT DISTINCT ON (tag) tag, "time" FROM tab1 ORDER BY tag, "time" DESC;
And that should give you what you want. It's a bit weird, but it will work!
I'm not going to go through the other approaches, but most of them will be improved dramatically with the index, but this is still likely going to be the most performant approach.
If you want to, please post a comment with how this worked and if it sped things up for you!