I'm given a list of transaction records from a remote server, some of which already exist in our database and some of which are new. My task is to update the ones that already exist and insert the ones that don't. Assume the transactions have remote IDs that aren't dependent on my local database. The size of the list can be anywhere from 1 to ~500.
Database is postgresql.
My initial thought was something like this:
BEGIN
CREATE TEMP TABLE temp_transactions (LIKE transactions) ON COMMIT DROP;
INSERT INTO temp_transactions(...) VALUES (...);
WITH updated_transactions AS (...update statement...)
DELETE FROM temp_transactions USING updated_transactions
WHERE temp_transactions.external_id = updated_transactions.external_id;
INSERT INTO transactions SELECT ... FROM temp_transactions;
COMMIT;
In other words:
But then I began to wonder whether it might be more efficient to use a per-session temp table and not wrap all the operations in a single transaction. My database sessions are only ever going to be used by a single thread, so this should be possible:
CREATE TEMP TABLE temp_transactions IF NOT EXISTS (LIKE transactions);
INSERT INTO temp_transactions(...) VALUES (...);
WITH updated_transactions AS (...update statement...)
DELETE FROM temp_transactions USING updated_transactions
WHERE temp_transactions.external_id = updated_transactions.external_id;
INSERT INTO transactions SELECT ... FROM temp_transactions;
TRUNCATE temp_transactions;
My thinking:
This avoids having to create the temp table each time a new batch of records is received. Instead, if a batch has already been processed using this database session (which is likely) the table will already exist.
This saves rollback space since I'm not stringing together multiple operations within a single transaction. It isn't a requirement that the entire update/insert operation be atomic; the only reason I was using a transaction is so the temp table would be automatically dropped upon commit.
Is the latter method likely to be superior to the former? Does either method have any special "gotchas" I should be aware of?
What you're describing is commonly known as upsert. Even the official documentation mentions it, here: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE
The biggest problem with upserts are concurrency problems, as described here: http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ and here: http://johtopg.blogspot.com.br/2014/04/upsertisms-in-postgres.html
I think your approach is good, although I wouldn't use a temporary table at all, and put the VALUES part into the UPDATE part, to make the whole thing a single statement.
Like this:
CREATE TABLE test (id int, data int);
CREATE TABLE
WITH new_data (id, data) AS (
VALUES (1, 2), (2, 6), (3, 10)
),
updated AS (
UPDATE test t
SET data = v.data
FROM new_data v
WHERE v.id = t.id
RETURNING t.id
)
INSERT INTO test
SELECT *
FROM new_data v
WHERE NOT EXISTS (
SELECT 1
FROM updated u
WHERE u.id = v.id
);
INSERT 0 3
SELECT * FROM test;
id | data
----+------
1 | 2
2 | 6
3 | 10
(3 rows)
WITH new_data (id, data) AS (
VALUES (1, 20), (2, 60), (4, 111)
),
updated AS (
UPDATE test t
SET data = v.data
FROM new_data v
WHERE v.id = t.id
RETURNING t.id
)
INSERT INTO test
SELECT *
FROM new_data v
WHERE NOT EXISTS (
SELECT 1
FROM updated u
WHERE u.id = v.id
);
INSERT 0 1
SELECT * FROM test;
id | data
----+------
3 | 10
1 | 20
2 | 60
4 | 111
(4 rows)
PG 9.5+ will support concurrent upserts out of the box, with the INSERT ... ON CONFLICT DO NOTHING/UPDATE syntax.