Search code examples
pythonmysqlgoogle-app-enginesqlalchemy

Optimizing an Update statement with many records in SQLAlchemy


I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?

For some reference, I am performing an update on 40,000 records and it took about 1 hour.

Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.

def update_records(table_name, column, pairs):
    table = Table(table_name, db.MetaData, autoload=True, 
    autoload_with=db.engine)
    conn = db.engine.connect()

    values = []

    for id, value in pairs:
        values.append({'row_id': id, 'match_value': str(value)})

    stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
    conn.execute(stmt, values)

Solution

  • Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:

    # assuming temp table already created
    conn.execute(temp_table.insert().values(values))
    conn.execute(table.update().values({column: temp_table.c.match_value})
                               .where(table.c.id == temp_table.c.row_id))
    

    Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).