Search code examples
pandasgroup-bysplit-apply-combine

Pandas calculate elapsed time


data = [    
            {'DateTime': pd.to_datetime('2023-09-26 09:36:47'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER01'},
            {'DateTime': pd.to_datetime('2023-09-26 11:01:55'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER01'},
            {'DateTime': pd.to_datetime('2023-09-27 10:01:20'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER02'},
            {'DateTime': pd.to_datetime('2023-09-27 14:46:09'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER02'},
            {'DateTime': pd.to_datetime('2023-09-27 13:27:57'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER01'},
            {'DateTime': pd.to_datetime('2023-09-27 18:03:09'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER01'}
]
df = pd.DataFrame(data)
df.groupby([pd.Grouper(key='DateTime', freq='D'),'User', 'LicenseCode'])

Hi guys, I'd like to create a new elapsed time column (in hours) for licenses from the sample data of license checkout. This is just a snippet of 500k+ lines.

Conditions for new column:

  • elapsed time will be grouped by day, user & license code as only elapsed time for each day will be tabulated for each unique user & licensecode

  • if df['Action'] == OUT, elapsed time = 0

  • if (df['Action'] == OUT) & (df['Action'].shift() == OUT), elapsed time = df[DateTime].diff()

  • if 'Action' == IN, elapsed time = calculate cumulative time diff of previous 'OUT' line (they work in pairs OUT then IN)

  • there are instances of OUT,OUT,IN,IN (order of 'Action' rows), most ideal will be OUT,IN,OUT,IN

I've hit a wall when I try to do a groupby agg with a custom function as I can't reference the columns by e.g. df['Action'].

Example of Desired Output :

DateTime Action LicenseCode User Elapsed_Time
2023-09-26 09:36:47 OUT APP01 USER01 0.00
2023-09-26 11:36:47 IN APP01 USER01 2.00
2023-09-27 09:30:00 OUT APP02 USER02 0.00
2023-09-27 14:30:00 IN APP02 USER02 5.00
2023-09-27 15:30:00 OUT APP02 USER02 5.00
2023-09-27 15:45:00 IN APP02 USER02 5.25
2023-09-27 16:10:00 OUT APP01 USER02 0.00
2023-09-27 16:40:00 IN APP01 USER02 0.50
2023-09-27 17:00:00 OUT APP01 USER02 0.50
2023-09-27 17:12:00 OUT APP01 USER02 0.70
2023-09-27 17:42:00 IN APP01 USER02 1.20
2023-09-27 17:52:00 IN APP01 USER02 1.37

Any thoughts will be greatly appreciated!


Solution

  • Thanks to Panda Kim, I've managed to retrieve the desired output. Here is the code modification from @Panda Kim's input.

    Code:

    data = [    
                {'DateTime': pd.to_datetime('2023-09-26 09:36:47'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER01'},
                {'DateTime': pd.to_datetime('2023-09-26 11:36:47'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER01'},
                {'DateTime': pd.to_datetime('2023-09-27 09:30:00'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 14:30:00'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 15:30:00'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 15:45:00'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 16:10:00'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 16:40:00'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 17:00:00'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 17:12:00'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 17:42:00'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER02'},
                {'DateTime': pd.to_datetime('2023-09-27 17:52:00'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER02'}
    ]
    
    df = pd.DataFrame(data)
    cond = (df['Action'].eq('OUT')) & (df['Action'].shift(+1).eq('IN'))
    grp = cond.groupby([df['LicenseCode'],df['User']]).cumsum()
    df['elapsed_time'] = df.groupby([pd.Grouper(key='DateTime', freq='D'), grp, 'User', 'LicenseCode'])['DateTime'].diff()
    df['elapsed_time'] = df['elapsed_time'].fillna(pd.Timedelta(seconds=0))
    df['elapsed_time'] = (df.groupby([pd.Grouper(key='DateTime', freq='D'),'User', 'LicenseCode']).elapsed_time.cumsum() \
        .dt.total_seconds()/3600)  \
        .round(2)
    
    print(df)
    

    Output: enter image description here