Search code examples
sqlpostgresqlpostgresql-9.3

How to "merge" two tables efficiently in SQL?


I have two tables (e.g. 'foo' and 'bar') with identical structure and a primary key (e.g. 'a'):

CREATE TABLE foo(a INT PRIMARY KEY, b VARCHAR(10));
INSERT INTO foo(a, b) VALUES (1, 'foo'), (2, 'foo');
CREATE TABLE bar(a INT PRIMARY KEY, b VARCHAR(10));
INSERT INTO bar(a, b) VALUES (2, 'bar'), (3, 'bar');

Now I want to update the rows of the table 'foo' with the values of the rows in the table 'bar' where the primary key matches and I want to insert the rows from the table 'bar' into the table 'foo' if the primary key doesn't already exist in table 'foo':

UPDATE foo SET b = bar.b FROM bar WHERE foo.a = bar.a;
INSERT INTO foo SELECT bar.* FROM bar LEFT JOIN foo USING (a) WHERE foo.a IS NULL;

This does what I want but I wonder if there is a more efficient way to do this?


Solution

  • If your tables are not accessed by others concurrently (yet), you could use a FULL [OUTER] JOIN query to create a new, merged table, with values from bar getting priority.

    If you have concurrent access, but can afford to lock both tables, that works, too:

    BEGIN;
    LOCK foo, bar;                     -- if you need it
    
    CREATE TABLE baz AS
    SELECT a, COALESCE(b.b, a.b) AS b  -- bar gets priority
    FROM   foo f
    FULL   JOIN bar b USING (a)
    ORDER  BY a;                       -- optional order by
    
    -- you need table name foo?
    DROP  TABLE foo, bar;
    ALTER TABLE baz RENAME TO foo;
    ALTER TABLE foo ADD CONSTRAINT foo_a_pkey PRIMARY KEY (a);
    -- do more?
    
    COMMIT;
    

    If you have substantial overlap, it's more efficient to write a new, pristine (clustered) table without dead rows, than to update much of the old. If the overlap is not big, update / insert may be more efficient. If both tables are small, don't bother and go with the simple solution.

    The new table does not have any of the indexes or constraint of the old ones obviously. Recreate what you need.

    If you have a lot of depending objects (views, functions), you may want to keep your old table. Create a temp table instead, TRUNCATE foo and write back data into the same table. This also won't kill concurrent transactions waiting for the existing table.

    BEGIN;
    LOCK foo, bar;                     -- if you need it
    
    SET temp_buffers = 500MB;          -- if you need it
    
    CREATE TEMP TABLE tmp AS
    SELECT a, COALESCE(b.b, a.b) AS b  -- bar gets priority
    FROM   foo f
    FULL   JOIN bar b USING (a);
    
    -- for performance, you might want to drop indexes and constraints here ...
    TRUNCATE foo;
    INSERT INTO foo
    SELECT * FROM tmp
    ORDER BY a;                         -- optional
    
    DROP  TABLE  bar;                   -- optional
    -- ... and recreate previously dropped indexes and constraints here
    
    COMMIT;
    

    About temp_buffers:

    What happens to concurrent writes?