Search code examples
pythonpandaslambdaapplyshift

comparing rows data frame | shift and apply functions throwing exception


I am trying to derive a mean value for the average duration spent in a specific status by ID. For this I first sort my data frame by ID and date, and with the apply and shift function trying to deduct the date of row[i+1] - row[i] - given row[i+1] - row[i] are for the same ID.

I get the following exception: AttributeError: 'int' object has no attribute 'shift'

Below a code for simulation:

import datetime
from datetime import datetime
today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)

Can anyone please advise how to solve the last line with the lambda function?


Solution

  • I was not able to get it done with lambda. You can try like this:

    import datetime
    today =  datetime.datetime.today() # you want it as real date, not string
    frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
    frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
    frame_ordered = frame.sort_values(['id','date'], ascending=True)
    
    #add column with shifted date values
    frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
    
    # mask where the next row has same id as current one
    mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
    print(mask)
    
    # subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary 
    frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
    
    #delete shifted date column if you want
    frame_ordered = frame_ordered.drop('shifted', axis=1)
    
    print(frame_ordered)
    

    Output:

    #mask
    0    False
    4    False
    2    False
    3     True
    1    False
    Name: id, dtype: bool
    
    #frame_ordered
         id  status       date  duration
    0  1245       1 2022-07-01      25.0
    4  1248       6 2022-01-03     204.0
    2  2345       4 2022-04-20      97.0
    3  4556       5 2022-02-02      38.0
    1  4556       2 2022-03-12     136.0