Search code examples
pythonpostgresqlpsycopg2

Batch upsert multiple rows with psycopg2


I need to upsert (INSERT ... ON CONFLICT DO UPDATE) multiple rows at once into a postgreSQL database using psycopg2. Essentially, I have a list of tuples representing "rows", and I need to insert them into the database, or update the database if there is a conflict. I need (possibly) every column to be updated (if not inserted), along with every row.

I've tried two main approaches, using psycopg2's cursor.execute() function and execute_many() function. First, I did the following:

upsert_statement = 'INSERT INTO table (col1, col2, col3) VALUES %s ON CONFLICT (col1) DO UPDATE SET (col1, col2, col3) = ROW (excluded.*) WHERE table IS DISTINCT FROM excluded'

psycopg2.extras.execute_values(cursor, upsert_statement, values)

I create an SQL statement that inserts the values using execute_many() (where values passed to it is a list of tuples), and on a conflict the column values should be updated to excluded. However, I get the error SyntaxError: number of columns does not match number of values sometimes, even though I know for a fact that the number of columns and values are the same.

So, I tried using only execute():

upsert_statement = f'INSERT INTO table (col1, col2, col3) VALUES (value1, value2, value3), (value4, value5, value6)... ON CONFLICT (col1) DO UPDATE SET (col1, col2, col3) = (value1, value2, value3), (value4, value5, value6)...'

cursor.execute(upsert_statement)

Here, I do the batch upsert as part of the SQL, and so don't have to use execute_values(). However, I get a SyntaxError after the DO UPDATE SET, because I don't think it's valid to have (col1, col2, col3) = (value1, value2, value3), (value4, value5, value6)....

What am I doing wrong? How can I bulk upsert multiple rows using psycopg2?

(I should note that in reality, (col1, col2, col3) and (value1, value2, value3) are dynamic, and change frequently)


Solution

  • You need to use table EXCLUDED instead of value literals in your ON CONFLICT statement:

    The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the existing row using the table's name (or an alias), and to the row proposed for insertion using the special excluded table.

    You also don't need to re-set the conflicting values, only the rest.

    INSERT INTO table (col1, col2, col3) 
    VALUES 
        (value1, value2, value3), 
        (value4, value5, value6)
    ON CONFLICT (col1) DO UPDATE 
    SET (col2, col3) = (EXCLUDED.col2, EXCLUDED.col3);
    

    For readability, you can format your in-line SQLs if you triple-quote your f-strings. I'm not sure if and which IDEs can detect it's an in-line SQL in Python and switch syntax highlighting, but I find indentation helpful enough.

    upsert_statement = f"""
        INSERT INTO table (col1, col2, col3) 
        VALUES 
            ({value1}, {value2}, {value3}), 
            ({value4}, {value5}, {value6})
        ON CONFLICT (col1) DO UPDATE 
        SET (col2, col3) = (EXCLUDED.col2, EXCLUDED.col3)"""
    

    Here's a test at db<>fiddle:

    drop table if exists test_70066823 cascade;
    create table test_70066823 (
        id integer primary key, 
        text_column_1 text, 
        text_column_2 text);
    insert into test_70066823 values
      (1,'first','first')
     ,(2,'second','second') returning *;
    
    id text_column_1 text_column_2
    1 first first
    2 second second
    insert into test_70066823
    values  (1, 'third','first'),
            (3, 'fourth','third'),
            (4, 'fifth','fourth'),
            (2, 'sixth','second')
    on conflict (id) do update 
    set text_column_1=EXCLUDED.text_column_1,
        text_column_2=EXCLUDED.text_column_2
    returning *;
    
    id text_column_1 text_column_2
    1 third first
    3 fourth third
    4 fifth fourth
    2 sixth second

    You can refer to this for improved insert performance. Inserts with a simple string-based execute or execute_many are the top 2 slowest approaches mentioned there.