Search code examples
pythonpandasdataframeaws-data-wrangler

Pandas merge two DF with rows replacement


I faced with an issue to merge two DF into one and save all duplicate rows by id value from the second DF. Example:

df1 = pd.DataFrame({
    'id': ['id1', 'id2', 'id3', 'id4'],
    'com': [134.6, 223, 0, 123],
    'malicious': [False, False, True, False]
})

df2 = pd.DataFrame({
    'id': ['id7', 'id2', 'id5', 'id6'],
    'com': [134.6, 27.6, 0, 123],
    'malicious': [False, False, False, False]
})

df1
    id    com  malicious
0  id1  134.6      False
1  id2  223.0      False
2  id3    0.0       True
3  id4  123.0      False

df2
    id    com  malicious        date
0  id7  134.6      False  2021-01-01
1  id2   27.6      False  2021-01-01
2  id5    0.0      False  2021-01-01
3  id6  123.0      False  2021-01-01

I'm. expecting the output to be:

    id    com  malicious        date
1  id1  134.6      False  null
2  id3    0.0       True  null
3  id4  123.0      False  null
4  id7  134.6      False  2021-01-01
5  id2   27.6      False  2021-01-01
6  id5    0.0      False  2021-01-01
7  id6  123.0      False  2021-01-01

As you can see we added a new column and all rows of df1 a null there now and row with id2 is replaced with all values from df2(amount of updated columns can be different, so it's not about updating specific columns values, but about replacing the whole row by id) I dont care about indexes and sort

Looking for efficient solution as I have huge amount of files, which I should merge this way into main DF


Solution

  • If need unique id with remove values from df1 if exist also in df2 use:

    df = pd.concat([df1, df2]).drop_duplicates('id', keep='last')