Search code examples
pandaspostgresqlsql-update

Update Postgres Database with changes made to Pandas Dataframe


I have a Postgres database which contains a table I want to make some changes too. I've selected 1000 record from the table which contains over 500,0000 records and processed them adding / updating Columns C & D.

db_table

pk col c col d
One updated updated
from sqlalchemy import create_engine

engine = create_engine(f'postgresql://{db_usr}:{db_pwd}@{db_host}:{db_port}/{db_name}')
df = pd.read_sql('Select * FROM db_table LIMIT 1000', engine)

Next I make some changes to column C & D in the pandas dataframe and now want to update these columns back in the Postgres Database (no changes are made to column B and I don't want to write this back to the database).

I now have the following code, it runs without error however no changes are made to the database. The first time it ran to the point of creating a temporary table which I deleted manually, now no changes are made to the database

´´´ conn = psycopg2.connect(f"dbname='{cfg.db_name}' user='{cfg.db_usr}' host='{cfg.db_host}' password='{cfg.db_pwd}'") cur = conn.cursor()

rows = zip(df_properties.index, df_properties.score)


cur.execute("""CREATE TEMP TABLE listing_scores(pk INTEGER, score INTEGER) ON COMMIT DROP""")
cur.executemany("""INSERT INTO listing_scores (pk, score) VALUES(%s, %s)""", rows)
cur.execute(f"""
    UPDATE {cfg.db_rm_table}
    SET score = listing_scores.score
    FROM listing_scores
    WHERE listing_scores.pk = {cfg.db_rm_table}.pk;
    """)

print(cur.rowcount)
conn.commit()
cur.close()
conn.close()´´´

According to the cur.rowcount all the updates are completed however this is not showing up in the database. I'm able to create a new table via pgadmin4 but not in terminal or python using the same credentials


Solution

  • It may be as simple as committing after each row insert or update statement (depending on approach), but I presented a longer response in case it is useful.

    I use pyodbc and SQL Server, but I think you were onto a solution when you mentioned a temporary table. If you create a dataframe from a SQL query, you can insert those rows into a database temporary table, then run your update statement, as presented in your pseudo code. This example is for SQL Server, so you may need to modify for Postgres. So, use your cursor to set up the temp table:

    cur.execute(f"""
      USE [database_name];
      CREATE TABLE #ttable (
      column [datatype] (size) NULL ,
      ...
      lastcolumn [datatype] (size) NULL);
      """)
    

    Next, iterate your dataframe to insert rows into the temp table. This could take time, depending on number of columns, data volume, database configuration, etc.

    First, you'll need to set up a string for your INSERT statement.

    # assume you have a list of column names
    colname = ['column',...'lastcolumn']
    
    # put these in a string for use later
    strcolname =  (", ".join(colname)) 
    

    Pyodbc allows you to insert values where there is a question mark ('?'). You need one for each column.

    # length of column name list
    clen = len(colname)
    
    # question mark list
    for i in range(clen):
      ql.append('?')
    
    # now make it a string
    qstr = ", ".join(ql)
    

    With this in place, you can set up a string that will contain the INSERT statement but not the values - that comes later.

    # insert query for temp table
    intt = ('INSERT INTO #ttable (' + strcolname + ') values (' +
                qstr + ')')
    

    Assuming you've created your dataframe (df), now iterate the rows and execute the INSERT statement using the values from the dataframe row. Row-by-agonizing-row (rbar) can be slow and with pyodbc you need a commit() for each execution. That is a potential drawback of this approach.

    for index, row in df.iterrows():
      cur.execute(intt, tuple(row))
      cur.commit()
    

    Now, you have a temporary table and you can use it to execute an UPDATE statement, along the lines provided in your pseudo code. Note: you only have to run this once, since all the rows you want are in the temporary table.

    cur.execute(f"""
        UPDATE cfg.db_rm_table
        SET score = #ttable.score
        FROM #ttable
        WHERE #ttable.pk = cfg.db_rm_table.pk;
        """)
    

    You seem to be editing as quickly as I am writing, so there may be some overlap, but the gist is, you iterate your dataframe rows to insert into the temporary table, then run a single update statement using value from the temporary table.

    With pyodbc, as long as it is run in the same connection, everything works smoothly. I am not sure with the approach you are using. But the toolset looks similar in construct.

    I hope that this is somehow useful.