Let's say I have the following dataset
ID | Name | balance | Year
112 Johnstown $321 2020
321 Oregon $214 2020
121 Jackson $254 2020
and the following incoming dataset
112 Johnstown $321 2021
321 Oregon $216 2020
121 Jackson $254 2020
What I want to do is combine these two datasets while retaining a concept of seasonality. Basically, if a record has any field other than the year changed, update that record. However, if the record had the year changed, then make a new record.
So in our case, the result dataset would look like this
112 Johnstown $321 2021
112 Johnstown $321 2020
321 Oregon $216 2020
121 Jackson $254 2020
This is essentially an upsert operation, the way I'm thinking about it is as an upsert on a combination key of ID and season. Basically, if the ID and season are the same update the existing record, and if they're different add a new record. In other words
3.If a record is different and year/id are different make a new record
Is this possible with dataframes? If not, is there another structure I should look at for implementing this? Our datasets are just parquet files, so we're free to manipulate them however we like
Is this expected output? Based on your
Basically, if the ID and season are the same update the existing record, and if they're different add a new record.
We concatenate two dataframes, group by ID
and Year
and leave the last (thus, coming from df2
) element in each group.
>>> pd.concat([df1, df2]).groupby(["ID", "Year"], as_index=False).last()
ID Year Name balance
0 112 2020 Johnstown $321
1 112 2021 Johnstown $321
2 121 2020 Jackson $254
3 321 2020 Oregon $216