Search code examples
sqlpostgresqlsql-update

Update with limit and offset


Simplified scenario:

I want to update one table A (10 million rows) with a value from another table B (20 mil rows). The two tables are linked with ids.

It took more than 7hrs to update the whole thing in one go. (I don't know exactly how long as I stopped the script in the middle.)
So my idea is to update table A in batches using OFFSET and LIMIT clauses. So far with no luck.

Wrapped up in a procedure, the code looks like below:

DECLARE 
    offset_number integer := 0;
    batch_size integer := 1000; 

BEGIN
    LOOP 
        UPDATE TableA temp1
        SET TableA_column_value_to_be_updated = (
            SELECT 
                tableB_column_value
            FROM 
                TableB temp2
            WHERE 
                temp2.id = temp1.id 
                AND some_other_conditions_in_TableB    
            ) 
        WHERE    
             some_other_conditions_in_Table
        OFFSET offset_number 
        LIMIT batch_size ; 
            
        COMMIT; 
    
        offset_number := offset_number + batch_size;
        EXIT WHEN NOT FOUND; 
    END LOOP;
END;

The engine reports an error with exception:

org.jkiss.dbeaver.model.sql.DBSQLException:
SQL Error [42601]: ERROR: syntax error at or near "OFFSET"

I have no idea what it is. Notably, it seems to work without OFFSET and LIMIT.

Any ideas why this would happen? Should I use other loop statement?


Solution

  • LIMIT and OFFSET are not in the syntax of an SQL UPDATE statement. You need SELECT for that.
    Also, OFFSET scales poorly to "paginate" a big table. Remember the upper bound from the last iteration instead.

    Something like this could work:

    CREATE OR REPLACE PROCEDURE upd_in_batches(_batch_size int = 1000)
      LANGUAGE plpgsql AS
    $proc$
    DECLARE
       _id_bound int = 0;  -- or whatever?
    BEGIN
       LOOP
          WITH sel AS (
             SELECT a.id  -- id = PK!
             FROM   tablea a
             WHERE  a.id > _id_bound
          -- AND    <some other conditions in Table A>
             ORDER  BY a.id
             LIMIT  _batch_size
             FOR    UPDATE
             )
          , upd AS (
             UPDATE tablea a
             SET    target_col = b.b_source_col
             FROM   sel s
             JOIN   tableb b USING (id)
             WHERE  a.id = s.id
             AND    a.target_col IS DISTINCT FROM b.b_source_col
             )
          SELECT max(id)  -- always returns a row
          FROM   sel
          INTO   _id_bound;
    
          IF _id_bound IS NULL THEN
              EXIT;  -- no more rows found; we're done, exit loop
          ELSE
             COMMIT;
          END IF;
       END LOOP;
    END
    $proc$
    

    Use a SELECT statement instead to apply your LIMIT. To avoid race conditions with concurrent writes, throw in a locking clause (FOR UPDATE). You may or may not need that.

    You might be able to iUPDATE directly and just increment lower & upper bound for the filter on id instead, which is cheaper. Depends on the details of your setup and requirements. Each has its caveats.

    See: