Search code examples
pythonpandastimedeltacumsum

Correcting Pandas Cumulative Sum on a Timedelta Column


I'm currently have a line of code I'm using to try to create a column that is based on a Cumulative sum of timedelta data between dates. How ever its not correctly performing the Cumulative sum everywhere, and I was also given a warning that my line of python code wont work in the future.

The original dataset is below:

ID       CREATION_DATE             TIMEDIFF              EDITNUMB       
8211    11/26/2019 13:00                                    1        
8211    1/3/2020 9:11      37 days 20:11:09.000000000       1       
8211    2/3/2020 14:52     31 days 05:40:57.000000000       1       
8211    3/27/2020 15:00    53 days 00:07:49.000000000       1       
8211    4/29/2020 12:07    32 days 21:07:23.000000000       1

Here is my line of python code:

df['RECUR'] = df.groupby(['ID']).TIMEDIFF.apply(lambda x: x.shift().fillna(1).cumsum())

Which produces the new column 'RECUR' that is not summing cumulatively correctly from the data in the 'TIMEDIFF' column:

ID       CREATION_DATE             TIMEDIFF              EDITNUMB          RECUR
8211    11/26/2019 13:00                                    1       0 days 00:00:01.000000000
8211    1/3/2020 9:11      37 days 20:11:09.000000000       1       0 days 00:00:02.000000000
8211    2/3/2020 14:52     31 days 05:40:57.000000000       1       37 days 20:11:11.000000000
8211    3/27/2020 15:00    53 days 00:07:49.000000000       1       69 days 01:52:08.000000000
8211    4/29/2020 12:07    32 days 21:07:23.000000000       1       122 days 01:59:57.000000000

Which also produces this warning:

FutureWarning: Passing integers to fillna is deprecated, will raise a TypeError in a future version.  To retain the old behavior, pass pd.Timedelta(seconds=n) instead.

Any help on this will be greatly appreciated, the sum total should be 153 days starting from 11/26/19, and correctly displayed cumulatively in the 'RECUR' column.


Solution

  • You can fillna with a timedelta of 0 seconds and do the cumsum

    df['RECUR'] = df.groupby('ID').TIMEDIFF.apply(
        lambda x: x.fillna(pd.Timedelta(seconds=0)).cumsum())
    
    df['RECUR']
    # 0 0 days 00:00:00
    # 1 37 days 20:11:09
    # 2 69 days 01:52:06
    # 3 122 days 01:59:55
    # 4 154 days 23:07:18