Search code examples
pythonpandasdataframedata-manipulation

Pandas: Fixing end dates in a changelog


I have a dataframe representing all changes that have been made to a record over time. Among other things, this dataframe contains a record id (in this case not unique and not meant to be as it tracks multiple changes to the same record on a different table), startdate and enddate. Enddate is only included if it is know/preset, often it is not. I would like to map the enddate of each change record to the startdate of the next record in the dataframe with the same id.

>>> thing = pd.DataFrame([                               
...     {'id':1,'startdate':date(2021,1,1),'enddate':date(2022,1,1)},
...     {'id':1,'startdate':date(2021,3,24),'enddate':None},
...     {'id':1,'startdate':date(2021,5,26),'enddate':None},
...     {'id':2,'startdate':date(2021,2,2),'enddate':None},
...     {'id':2,'startdate':date(2021,11,26),'enddate':None}
... ])
>>> thing
   id   startdate     enddate
0   1  2021-01-01  2022-01-01
1   1  2021-03-24        None
2   1  2021-05-26        None
3   2  2021-02-02        None
4   2  2021-11-26        None

The dataframe is already sorted by the creation timestamp of the record and the id. I tried this:

thing['enddate'] = thing.groupby('id')['startdate'].apply(lambda x: x.shift())

However the above code only maps this to around 10,000 of my 120,000 rows, the majority of which would have an enddate if I were to do this comparison by hand. Can anyone think of a better way to perform this kind of manipulation? For reference, give the dataframe above I'd like to create this one:

    >>> thing
   id   startdate     enddate
0   1  2021-01-01  2021-03-24
1   1  2021-03-24  2021-05-26
2   1  2021-05-26        None
3   2  2021-02-02  2021-11-26
4   2  2021-11-26        None

The idea is that once this transformation is done, I'll have a timeframe between which the configurations stored in the other columns (not impportant for this) were in place


Solution

  • here is one way to do it

    use transform with the groupby to assign back the values to the rows comprising the group

    df['enddate']=df.groupby(['id'])['startdate'].transform(lambda x: x.shift(-1))
    df
    
        id  startdate   enddate
    0   1   2021-01-01  2021-03-24
    1   1   2021-03-24  2021-05-26
    2   1   2021-05-26  NaT
    3   2   2021-02-02  2021-11-26
    4   2   2021-11-26  NaT