I have this table in PostgreSQL 13:
CREATE TABLE candles (
id serial primary key,
day integer,
minute integer,
"open" integer,
high integer,
low integer,
"close" integer,
volume integer,
id_d1 integer,
);
CREATE INDEX candles_idx1 ON public.candles (day,minute);
I try to update field id_d1
which should have id
of previous day on the same hour:
UPDATE candles s2
SET id_d1 = (SELECT id FROM candles s
WHERE s.id<s2.id
AND s.day<s2.day
AND s.minute=s2.minute
ORDER BY s.id DESC
LIMIT 1);
For smaller amount of data it works well. For 80k records it runs endlessly.
EXPLAIN the query:
Update on candles s2 (cost=0.00..744027.57 rows=80240 width=68)
-> Seq Scan on candles s2 (cost=0.00..744027.57 rows=80240 width=68)
SubPlan 1
-> Limit (cost=0.29..9.25 rows=1 width=4)
-> Index Scan Backward using candles_pkey on candles s (cost=0.29..2347.34 rows=262 width=4)
Index Cond: (id < s2.id)
Filter: ((day < s2.day) AND (minute = s2.minute))
I also tried (no id in WHERE clause):
EXPLAIN
UPDATE candles s2
SET id_d1 = (SELECT id FROM candles s
WHERE s.day<s2.day
AND s.minute=s2.minute
ORDER BY s.id DESC
LIMIT 1);
Result:
Update on candles s2 (cost=0.00..513040.75 rows=80240 width=68)
-> Seq Scan on candles s2 (cost=0.00..513040.75 rows=80240 width=68)
SubPlan 1
-> Limit (cost=0.29..6.37 rows=1 width=4)
-> Index Scan Backward using candles_pkey on candles s (cost=0.29..4784.85 rows=787 width=4)
Filter: ((day < s2.day) AND (minute = s2.minute))
How should I modify query or schema to run it in reasonable time?
Key to better performance (especially for your original query) is an index with inverted index columns. While being at it, make it UNIQUE
:
CREATE UNIQUE INDEX candles_idx1 ON public.candles (minute, day);
Equality column first. See:
If the index cannot be UNIQUE
, you have to tell us more about possible duplicates and how you intend to break ties.
If it can, consider using it as PK to replace the id column (completely). You may want an additional index on (day, minute)
...
While updating all rows, it should be (much) faster to join to a single subquery with the window function lag()
in a FROM
clause to compute all target values (instead of running a correlated subquery for every row):
UPDATE candles c
SET id_d1 = c2.prev_id
FROM (
SELECT id, lag(id) OVER (PARTITION BY minute ORDER BY day) AS prev_id
FROM candles
) c2
WHERE c.id = c2.id
If some rows can already have a correct id_d1
, add this line to avoid costly empty updates:
AND id_d1 IS DISTINCT FROM c2.prev_id
See:
While updating all rows, the index will probably not even be used with the new query.
With the index in place, consider dropping id_d1
from the table completely. Storing functionally dependent values tends to be a bad idea. Computing it on the fly with lag()
should be cheap. Then the value is always up to date automatically. Otherwise you have to think about how to keep the column up to date - which may be tricky.