Search code examples
pythonpandasdataframedata-analysismulti-index

How to subtract values within a hierarchical/multi-level index in Pandas within the same column


I am attempting to find the change in time within a specific session (my index)--my data frame looks like this:

                        time
sess_id     vis_id      

id1         vis_id1      t_0
            vis_id1      t_1
            vis_id1      t_2

id2         vis_id2      t_0
            vis_id2      t_1
            vis_id2      t_2

I want to create a column called delta_t (change in time) which subtracts the time stamps recursively--where the last time for every session contains filler character like a dash or something

                        time      delta_t
sess_id     vis_id      

id1         vis_id1      t_0     (t_1 - t_0) 
            vis_id1      t_1     (t_2 - t_1)
            vis_id1      t_2         - 

id2         vis_id2      t_3     (t_4 - t_3)
            vis_id2      t_4     (t_5 - t_4)
            vis_id2      t_5         -

    
    

Solution

  • We can groupby shift relative to level=0 or level="sess_id" to get the next row's value, and subtract from time:

    df['delta_t'] = df.groupby(level='sess_id')['time'].shift(-1) - df['time']
    

    Sample DataFrame and output:

                                   time         delta_t
    sess_id vis_id                                     
    id1     vis_id1 2021-01-11 00:00:00 6 days 04:27:31
            vis_id1 2021-01-17 04:27:31 4 days 03:45:26
            vis_id1 2021-01-21 08:12:57             NaT
    id2     vis_id2 2021-01-28 15:18:32 7 days 17:57:56
            vis_id2 2021-02-05 09:16:28 4 days 01:41:58
            vis_id2 2021-02-09 10:58:26             NaT
    

    We could groupby diff then groupby shift but this involves 2 groupbys:

    df['delta_t'] = (
        df.groupby(level='sess_id')['time'].diff()
            .groupby(level='sess_id').shift(-1)
    )
    

    If needing '-' over NaT np.where can be used to covert Timedelta to string and fillna with '-':

    # Calculate Delta
    df['delta_t'] = df.groupby(level='sess_id')['time'].shift(-1) - df['time']
    # Change dtype and add in '-'
    df['delta_t'] = np.where(df['delta_t'].notna(), df['time'].astype(str), '-')
    

    Alternatively can convert to str and replace "NaT" with "-":

    # Calculate Delta, convert to String, replace "NaT" with "-"
    df['delta_t'] = (
            df.groupby(level='sess_id')['time'].shift(-1) - df['time']
    ).astype(str).replace('NaT', '-')
    

    df:

                                   time          delta_t
    sess_id vis_id                                      
    id1     vis_id1 2021-01-11 00:00:00  6 days 04:27:31
            vis_id1 2021-01-17 04:27:31  4 days 03:45:26
            vis_id1 2021-01-21 08:12:57                -
    id2     vis_id2 2021-01-28 15:18:32  7 days 17:57:56
            vis_id2 2021-02-05 09:16:28  4 days 01:41:58
            vis_id2 2021-02-09 10:58:26                -
    

    DataFrame constructor and imports:

    import pandas as pd
    
    df = pd.DataFrame(
        {'time': pd.to_datetime(['2021-01-11 00:00:00', '2021-01-17 04:27:31',
                                 '2021-01-21 08:12:57', '2021-01-28 15:18:32',
                                 '2021-02-05 09:16:28', '2021-02-09 10:58:26'])},
        index=pd.MultiIndex.from_arrays((['id1'] * 3 + ['id2'] * 3,
                                         ['vis_id1'] * 3 + ['vis_id2'] * 3),
                                        names=['sess_id', 'vis_id'])
    )