Search code examples
pythonpandasdataframeparquet

pandas Dataframe Concat/Upsert on a Combination Key


Let's say I have the following dataset

ID | Name | balance | Year
112 Johnstown $321 2020
321 Oregon $214 2020
121 Jackson $254 2020

and the following incoming dataset

112 Johnstown $321 2021
321 Oregon $216 2020
121 Jackson $254 2020

What I want to do is combine these two datasets while retaining a concept of seasonality. Basically, if a record has any field other than the year changed, update that record. However, if the record had the year changed, then make a new record.

So in our case, the result dataset would look like this

112 Johnstown $321 2021
112 Johnstown $321 2020
321 Oregon $216 2020
121 Jackson $254 2020

This is essentially an upsert operation, the way I'm thinking about it is as an upsert on a combination key of ID and season. Basically, if the ID and season are the same update the existing record, and if they're different add a new record. In other words

  1. If a record is exactly the same do nothing
  2. If a record is different but year/id are the same take the record from the new dataset

3.If a record is different and year/id are different make a new record

Is this possible with dataframes? If not, is there another structure I should look at for implementing this? Our datasets are just parquet files, so we're free to manipulate them however we like


Solution

  • Is this expected output? Based on your

    Basically, if the ID and season are the same update the existing record, and if they're different add a new record.

    We concatenate two dataframes, group by ID and Year and leave the last (thus, coming from df2) element in each group.

    >>> pd.concat([df1, df2]).groupby(["ID", "Year"], as_index=False).last()
        ID  Year       Name balance
    0  112  2020  Johnstown    $321
    1  112  2021  Johnstown    $321
    2  121  2020    Jackson    $254
    3  321  2020     Oregon    $216