Search code examples
pythonpandasgroup-bypandas-groupbycrosstab

Finding a timedelta in pandas dataframe based upon specific values in one column


I have a dataframe where i wish to compute the number of days (timedelta) that a unique asset remained installed. The sample input dataframe is as follows

df =pd.DataFrame({'Date': ['2007-11-01','2012-03-02','2012-03-02','2013-04-01','2013-04-01','2017-11-15','2017-11-15'], 'action':['installed','installed','removed','installed','removed','installed','removed'], 'asset_alphnum':['A-3724','A-3534','A-3724','A2732','A-3534','A-2007','A2732']})

Outputs:

enter image description here

I have tried pd.crosstab

pd.crosstab(df.asset_alphnum, [df.Date, df.action])

outputs enter image description here

However I cannot work out how to take it from here. Somehow need to collapse the hierarchical index and subract one date from the next.

Any guidance much appreciated.


Solution

  • Assuming you only ever have one install date and one removed date for each ID, you can do it with a pivot table, by using max (or min or any other aggregating function) on the date values.

    df["Date"] = pd.to_datetime(df["Date"])  # Converting strings to datetimes
    df = df.pivot_table(index="asset_alphnum", columns="action", values="Date", aggfunc=max)  # max is just an example, most aggfuncs will work if there's only ever one date
    df["time_installed"] = df["removed"] - df["installed"]  # timedelta column
    

    Which outputs this:

    asset_alphnum installed removed time_installed
    A-2007 2017-11-15 00:00:00 NaT NaT
    A-3534 2012-03-02 00:00:00 2013-04-01 00:00:00 395 days 00:00:00
    A-3724 2007-11-01 00:00:00 2012-03-02 00:00:00 1583 days 00:00:00
    A2732 2013-04-01 00:00:00 2017-11-15 00:00:00 1689 days 00:00:00