PROBLEM
I have the following dataframe (I am working with confidential data, so I will provide a sample):
import pandas as pd
df = pd.read_csv('file.csv')
df
name weight height smoke_or_not
0 Ajitesh 84 183 no
1 Shail 79 186 yes
2 Seema 67 158 no
3 Nidhi 52 155 no
I am saving this dataframe as a sql table using the following code. It is not included here, but I have already made de connection to the SQL Server database.
df.to_sql('table name', engine, if_exists='replace', index=False)
This dataframe is coming from a file that is beging updated daily. This update may contain new rows or even modifications of the existing rows.
For example, here I have modifications in the first row and a new row:
name weight height smoke_or_not
0 John 84 183 yes
1 Shail 79 186 yes
2 Seema 67 158 no
3 Nidhi 52 155 no
4 Mary 70 162 no
QUESTION
How could I identify the changes in existing rows and the new rows to run INSERT and UPDATE queries on SQLAlchemy?
I have my df
coming from the file that is being updated daily.
I also have a df2
that is coming from the SQL database with a SELECT * from table query.
WHAT I HAVE TRIED
I have tried using:
changes = pd.merge(df, df2, how='outer', indicator=True)
right_only = changes[changes['_merge'] == 'right_only']
right_only
name weight height smoke_or_not _merge
4 John 84 183 yes right_only
5 Mary 70 162 no right_only
With this I can get a new dataframe with all the rows that were changed + the new row. However, there are no distinction about which rows were changed and which ones are the new ones. So I can't separate which rows I am going to UPDATE or INSERT. Is there any easier way to do that?
Insert the pandas data frame into a temporary table or staging table, and then upsert the data in TSQL using MERGE or UPDATE and INSERT.
One simply way to get the pandas dataframe into SQL where you can reference the data in a SQL query is to send it as JSON and parse it on the server. See eg: Trying to insert pandas dataframe to temporary table