Search code examples
pythonsql-serverpandasdataframesqlalchemy

SQLAlchemy insert and update rows based on dataframe


PROBLEM

I have the following dataframe (I am working with confidential data, so I will provide a sample):

import pandas as pd
df = pd.read_csv('file.csv')
df

    name    weight  height  smoke_or_not
0   Ajitesh 84       183          no
1   Shail   79       186          yes
2   Seema   67       158          no
3   Nidhi   52       155          no

I am saving this dataframe as a sql table using the following code. It is not included here, but I have already made de connection to the SQL Server database.

df.to_sql('table name', engine, if_exists='replace', index=False)

This dataframe is coming from a file that is beging updated daily. This update may contain new rows or even modifications of the existing rows.

For example, here I have modifications in the first row and a new row:

    name    weight  height  smoke_or_not
0   John    84       183          yes
1   Shail   79       186          yes
2   Seema   67       158          no
3   Nidhi   52       155          no
4   Mary    70       162          no

QUESTION

How could I identify the changes in existing rows and the new rows to run INSERT and UPDATE queries on SQLAlchemy?

I have my df coming from the file that is being updated daily.

I also have a df2 that is coming from the SQL database with a SELECT * from table query.

WHAT I HAVE TRIED

I have tried using:

changes = pd.merge(df, df2, how='outer', indicator=True)
right_only = changes[changes['_merge'] == 'right_only']
right_only

    name    weight  height  smoke_or_not    _merge
4   John    84        183     yes           right_only
5   Mary    70        162     no            right_only

With this I can get a new dataframe with all the rows that were changed + the new row. However, there are no distinction about which rows were changed and which ones are the new ones. So I can't separate which rows I am going to UPDATE or INSERT. Is there any easier way to do that?


Solution

  • Insert the pandas data frame into a temporary table or staging table, and then upsert the data in TSQL using MERGE or UPDATE and INSERT.

    One simply way to get the pandas dataframe into SQL where you can reference the data in a SQL query is to send it as JSON and parse it on the server. See eg: Trying to insert pandas dataframe to temporary table