Search code examples
sqlperformancepostgresqldatabase-designrails-postgresql

best temp table strategy for update/insert operation


I'm given a list of transaction records from a remote server, some of which already exist in our database and some of which are new. My task is to update the ones that already exist and insert the ones that don't. Assume the transactions have remote IDs that aren't dependent on my local database. The size of the list can be anywhere from 1 to ~500.

Database is postgresql.

My initial thought was something like this:

BEGIN
  CREATE TEMP TABLE temp_transactions (LIKE transactions) ON COMMIT DROP;
  INSERT INTO temp_transactions(...) VALUES (...);
  WITH updated_transactions AS (...update statement...) 
    DELETE FROM temp_transactions USING updated_transactions 
    WHERE temp_transactions.external_id = updated_transactions.external_id;
  INSERT INTO transactions SELECT ... FROM temp_transactions;
COMMIT;

In other words:

  1. Create a temp table that exists only for the life of the transaction.
  2. Dump all my records into the temp table.
  3. Do all the updates in a single statement that also deletes the updated records from the temp table.
  4. Insert anything remaining in the temp table in to the permanent table because it wasn't an update.

But then I began to wonder whether it might be more efficient to use a per-session temp table and not wrap all the operations in a single transaction. My database sessions are only ever going to be used by a single thread, so this should be possible:

CREATE TEMP TABLE temp_transactions IF NOT EXISTS (LIKE transactions);
INSERT INTO temp_transactions(...) VALUES (...);
WITH updated_transactions AS (...update statement...) 
  DELETE FROM temp_transactions USING updated_transactions 
  WHERE temp_transactions.external_id = updated_transactions.external_id;
INSERT INTO transactions SELECT ... FROM temp_transactions;
TRUNCATE temp_transactions;

My thinking:

  1. This avoids having to create the temp table each time a new batch of records is received. Instead, if a batch has already been processed using this database session (which is likely) the table will already exist.

  2. This saves rollback space since I'm not stringing together multiple operations within a single transaction. It isn't a requirement that the entire update/insert operation be atomic; the only reason I was using a transaction is so the temp table would be automatically dropped upon commit.

Is the latter method likely to be superior to the former? Does either method have any special "gotchas" I should be aware of?


Solution

  • What you're describing is commonly known as upsert. Even the official documentation mentions it, here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE

    The biggest problem with upserts are concurrency problems, as described here: http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ and here: http://johtopg.blogspot.com.br/2014/04/upsertisms-in-postgres.html

    I think your approach is good, although I wouldn't use a temporary table at all, and put the VALUES part into the UPDATE part, to make the whole thing a single statement.

    Like this:

    CREATE TABLE test (id int, data int);
    CREATE TABLE
    
    WITH new_data (id, data) AS (
        VALUES (1, 2), (2, 6), (3, 10)
    ),
    updated AS (
        UPDATE test t
        SET data = v.data
        FROM new_data v
        WHERE v.id = t.id
        RETURNING t.id
    )
    INSERT INTO test
    SELECT *
    FROM new_data v
    WHERE NOT EXISTS (
        SELECT 1
        FROM updated u
        WHERE u.id = v.id
    );
    INSERT 0 3
    
    
    SELECT * FROM test;
     id | data 
    ----+------
      1 |    2
      2 |    6
      3 |   10
    (3 rows)
    
    WITH new_data (id, data) AS (
        VALUES (1, 20), (2, 60), (4, 111)
    ),
    updated AS (
        UPDATE test t
        SET data = v.data
        FROM new_data v
        WHERE v.id = t.id
        RETURNING t.id
    )
    INSERT INTO test
    SELECT *
    FROM new_data v
    WHERE NOT EXISTS (
        SELECT 1
        FROM updated u
        WHERE u.id = v.id
    );
    INSERT 0 1
    
    SELECT * FROM test;
     id | data 
    ----+------
      3 |   10
      1 |   20
      2 |   60
      4 |  111
    (4 rows)
    

    PG 9.5+ will support concurrent upserts out of the box, with the INSERT ... ON CONFLICT DO NOTHING/UPDATE syntax.