Search code examples
sqlpostgresqlsql-updategreatest-n-per-grouppostgresql-13

UPDATE table column with latest related id from the same table


I have this table in PostgreSQL 13:

CREATE TABLE candles (
    id serial primary key,
    day integer,
    minute integer,
    "open" integer,
    high integer,
    low integer,
    "close" integer,
    volume integer,
    id_d1 integer,
);
CREATE INDEX candles_idx1 ON public.candles (day,minute);

I try to update field id_d1 which should have id of previous day on the same hour:

UPDATE candles s2
SET id_d1 = (SELECT id FROM candles s
             WHERE s.id<s2.id
             AND s.day<s2.day
             AND s.minute=s2.minute
             ORDER BY s.id DESC
             LIMIT 1);

For smaller amount of data it works well. For 80k records it runs endlessly.

EXPLAIN the query:

Update on candles s2  (cost=0.00..744027.57 rows=80240 width=68)
   ->  Seq Scan on candles s2  (cost=0.00..744027.57 rows=80240 width=68)
         SubPlan 1
           ->  Limit  (cost=0.29..9.25 rows=1 width=4)
                 ->  Index Scan Backward using candles_pkey on candles s  (cost=0.29..2347.34 rows=262 width=4)
                       Index Cond: (id < s2.id)
                       Filter: ((day < s2.day) AND (minute = s2.minute))

I also tried (no id in WHERE clause):

EXPLAIN
UPDATE candles s2
SET id_d1 = (SELECT id FROM candles s
             WHERE s.day<s2.day
             AND s.minute=s2.minute
             ORDER BY s.id DESC
             LIMIT 1);

Result:

Update on candles s2  (cost=0.00..513040.75 rows=80240 width=68)
   ->  Seq Scan on candles s2  (cost=0.00..513040.75 rows=80240 width=68)
         SubPlan 1
           ->  Limit  (cost=0.29..6.37 rows=1 width=4)
                 ->  Index Scan Backward using candles_pkey on candles s  (cost=0.29..4784.85 rows=787 width=4)
                       Filter: ((day < s2.day) AND (minute = s2.minute))

How should I modify query or schema to run it in reasonable time?


Solution

  • Key to better performance (especially for your original query) is an index with inverted index columns. While being at it, make it UNIQUE:

    CREATE UNIQUE INDEX candles_idx1 ON public.candles (minute, day);
    

    Equality column first. See:

    If the index cannot be UNIQUE, you have to tell us more about possible duplicates and how you intend to break ties.

    If it can, consider using it as PK to replace the id column (completely). You may want an additional index on (day, minute) ...

    While updating all rows, it should be (much) faster to join to a single subquery with the window function lag() in a FROM clause to compute all target values (instead of running a correlated subquery for every row):

    UPDATE candles c
    SET    id_d1 = c2.prev_id
    FROM  (
       SELECT id, lag(id) OVER (PARTITION BY minute ORDER BY day) AS prev_id
       FROM   candles
       ) c2
    WHERE  c.id = c2.id
    

    If some rows can already have a correct id_d1, add this line to avoid costly empty updates:

    AND    id_d1 IS DISTINCT FROM c2.prev_id
    

    See:

    While updating all rows, the index will probably not even be used with the new query.

    With the index in place, consider dropping id_d1 from the table completely. Storing functionally dependent values tends to be a bad idea. Computing it on the fly with lag() should be cheap. Then the value is always up to date automatically. Otherwise you have to think about how to keep the column up to date - which may be tricky.