Search code examples
pythonpandasdataframecomparison

Python, Pandas: Compare dataframes and keep old, updated and new ones separately


Good evening,

let's say I have two dataframes:

Dataframe 1:

id    |    first_name    |    last_name    |    age    |    personnel_number
1     |    Jane          |    Doe          |    37     |    0045ac
2     |    John          |    Doe          |    35     |    0102ha
3     |    Sarah         |    Smith        |    28     |    1003px
17    |    Michael       |    Mueller      |    61     |    0800pw

Dataframe 2:

id    |    first_name    |    last_name    |    age    |    personnel_number
1     |    Jane          |    Doe          |    37     |    0045ac
2     |    John          |    Doe          |    35     |    0102ha
3     |    Sarah         |    Smith        |    41     |    1003px
4     |    Sam           |    Smith        |    24     |    0017ix

I know, that with the following code I'm getting a new dataframe, where the existing rows are getting updated and the new ones are added...

df_comp = df2.set_index('personnel_number').combine_first(df1.set_index('personnel_number')).reset_index()

...to achieve this:

Combinded Dataframe:

id    |    first_name    |    last_name    |    age    |    personnel_number
1     |    Jane          |    Doe          |    37     |    0045ac
2     |    John          |    Doe          |    35     |    0102ha
3     |    Sarah         |    Smith        |    41     |    1003px
17    |    Michael       |    Mueller      |    61     |    0800pw
4     |    Sam           |    Smith        |    24     |    0017ix

My question: Is there a way to achieve three instead of one combined dataframe with the following data:

  • Dataframe with the already existing data that hasn't changed
  • Dataframe with rows that got updated
  • Dataframe with the new rows

Notes

  • there is always a column with unique data ('personnel_number' in this example)

Thanks for your help and advice and have a nice weekend!


Solution

  • You can try outer merge with indicator and then some conditions followed by a groupby then store in a dictionary:

    out = df2.merge(df1,how='outer',indicator='group')
    
    c = out.groupby("personnel_number",sort=False).transform('nunique').gt(1).any(1)
    
    out['group'] = (np.select([out['group'].eq("both"),out['group'].ne("both") & c,
                               out['group'].isin(['both','left_only']) & ~c],
                              ['Already_exists','Updated','New']))
    
    d = dict(iter(out.groupby("group")))
    

    Output:

    enter image description here