pandas Dataframe Concat/Upsert on a Combination Key

Let's say I have the following dataset

ID | Name | balance | Year
112 Johnstown $321 2020
321 Oregon $214 2020
121 Jackson $254 2020

and the following incoming dataset

112 Johnstown $321 2021
321 Oregon $216 2020
121 Jackson $254 2020

What I want to do is combine these two datasets while retaining a concept of seasonality. Basically, if a record has any field other than the year changed, update that record. However, if the record had the year changed, then make a new record.

So in our case, the result dataset would look like this

112 Johnstown $321 2021
112 Johnstown $321 2020
321 Oregon $216 2020
121 Jackson $254 2020

This is essentially an upsert operation, the way I'm thinking about it is as an upsert on a combination key of ID and season. Basically, if the ID and season are the same update the existing record, and if they're different add a new record. In other words

If a record is exactly the same do nothing
If a record is different but year/id are the same take the record from the new dataset

3.If a record is different and year/id are different make a new record

Is this possible with dataframes? If not, is there another structure I should look at for implementing this? Our datasets are just parquet files, so we're free to manipulate them however we like

Solution

Is this expected output? Based on your

Basically, if the ID and season are the same update the existing record, and if they're different add a new record.

We concatenate two dataframes, group by ID and Year and leave the last (thus, coming from df2) element in each group.

>>> pd.concat([df1, df2]).groupby(["ID", "Year"], as_index=False).last()
    ID  Year       Name balance
0  112  2020  Johnstown    $321
1  112  2021  Johnstown    $321
2  121  2020    Jackson    $254
3  321  2020     Oregon    $216