How to "merge" two tables efficiently in SQL?

I have two tables (e.g. 'foo' and 'bar') with identical structure and a primary key (e.g. 'a'):

CREATE TABLE foo(a INT PRIMARY KEY, b VARCHAR(10));
INSERT INTO foo(a, b) VALUES (1, 'foo'), (2, 'foo');
CREATE TABLE bar(a INT PRIMARY KEY, b VARCHAR(10));
INSERT INTO bar(a, b) VALUES (2, 'bar'), (3, 'bar');

Now I want to update the rows of the table 'foo' with the values of the rows in the table 'bar' where the primary key matches and I want to insert the rows from the table 'bar' into the table 'foo' if the primary key doesn't already exist in table 'foo':

UPDATE foo SET b = bar.b FROM bar WHERE foo.a = bar.a;
INSERT INTO foo SELECT bar.* FROM bar LEFT JOIN foo USING (a) WHERE foo.a IS NULL;

This does what I want but I wonder if there is a more efficient way to do this?

Solution

If your tables are not accessed by others concurrently (yet), you could use a FULL [OUTER] JOIN query to create a new, merged table, with values from bar getting priority.

If you have concurrent access, but can afford to lock both tables, that works, too:

BEGIN;
LOCK foo, bar;                     -- if you need it

CREATE TABLE baz AS
SELECT a, COALESCE(b.b, a.b) AS b  -- bar gets priority
FROM   foo f
FULL   JOIN bar b USING (a)
ORDER  BY a;                       -- optional order by

-- you need table name foo?
DROP  TABLE foo, bar;
ALTER TABLE baz RENAME TO foo;
ALTER TABLE foo ADD CONSTRAINT foo_a_pkey PRIMARY KEY (a);
-- do more?

COMMIT;

If you have substantial overlap, it's more efficient to write a new, pristine (clustered) table without dead rows, than to update much of the old. If the overlap is not big, update / insert may be more efficient. If both tables are small, don't bother and go with the simple solution.

The new table does not have any of the indexes or constraint of the old ones obviously. Recreate what you need.

If you have a lot of depending objects (views, functions), you may want to keep your old table. Create a temp table instead, TRUNCATE foo and write back data into the same table. This also won't kill concurrent transactions waiting for the existing table.

BEGIN;
LOCK foo, bar;                     -- if you need it

SET temp_buffers = 500MB;          -- if you need it

CREATE TEMP TABLE tmp AS
SELECT a, COALESCE(b.b, a.b) AS b  -- bar gets priority
FROM   foo f
FULL   JOIN bar b USING (a);

-- for performance, you might want to drop indexes and constraints here ...
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM tmp
ORDER BY a;                         -- optional

DROP  TABLE  bar;                   -- optional
-- ... and recreate previously dropped indexes and constraints here

COMMIT;

About temp_buffers:

How to update selected rows with values from a CSV file in Postgres?

What happens to concurrent writes?

Best way to populate a new column in a large table?