Search code examples
pythonpandastimedelta

Pandas: Calculate total timedelta of intermediate time entries


I have a dataframe that looks like below

Date Name In/Out Time
2024-01-01 Homer IN 07:10
2024-01-01 Homer OUT 09:30
2024-01-01 Homer IN 10:00
2024-01-01 Homer OUT 16:00
2024-01-01 Marge In 07:15
2024-01-01 Marge Out 16:10
2024-01-01 Bart In 07:14
2024-01-01 Bart Out 10:00
2024-01-01 Bart In 10:15
2024-01-01 Bart Out 12:00
2024-01-01 Bart In 12:30
2024-01-01 Bart Out 17:00

My end goal is to calculate the total timedelta (total_seconds()) of intermediate IN/OUT times only, as highlighted. Expected output below.

Date Name TimeDelta
2024-01-01 Homer 1800
2024-01-01 Bart 2700

I haven't been to sure where to start with this one and cannot find any examples anywhere else?

example dataframe:

example_df = pd.DataFrame([
    ['2024-01-01', 'Homer', 'in', '07:30'],
    ['2024-01-01', 'Homer', 'out' ,'09:00'],
    ['2024-01-01', 'Homer', 'in' ,'09:30'],
    ['2024-01-01', 'Homer', 'out' ,'16:00'],
    ['2024-01-01', 'Marge', 'in' , '06:20'],
    ['2024-01-01', 'Marge', 'out' ,'16:00'],
    ['2024-01-01', 'Bart', 'in' ,'07:10'],
    ['2024-01-01', 'Bart', 'out' ,'08:00'],
    ['2024-01-01', 'Bart', 'in' ,'08:20'],
    ['2024-01-01', 'Bart', 'out' ,'17:00'],
    ['2024-01-01', 'Barney', 'in' ,'08:10'],
    ['2024-01-01', 'Lisa', 'in' ,'08:05'],
    ['2024-01-01', 'Lisa', 'out' ,'14:00'],
    ['2024-01-01', 'Lisa', 'in' ,'14:15'],
    ['2024-01-01', 'Lisa', 'out' ,'18:10'],
    ['2024-01-01', 'Millhouse', 'out' ,'19:10'],
    ['2024-02-01', 'Homer', 'in', '07:30'],
    ['2024-02-01', 'Homer', 'out' ,'09:00'],
    ['2024-02-01', 'Marge', 'in' , '06:30'],
    ['2024-02-01', 'Marge', 'out' ,'09:10'],
    ['2024-02-01', 'Marge', 'in' ,'10:10'],
    ['2024-02-01', 'Marge', 'out' ,'16:10'],
    ['2024-02-01', 'Bart', 'in' ,'07:10'],
    ['2024-02-01', 'Bart', 'out' ,'15:00'],
    ['2024-02-01', 'Barney', 'in' ,'08:10'],
    ['2024-02-01', 'Lisa', 'in' ,'08:05'],
    ['2024-02-01', 'Lisa', 'out' ,'16:00'],
    ['2024-02-01', 'Millhouse', 'in' ,'08:10'],
    ['2024-02-01', 'Millhouse', 'in' ,'08:10'],
    ['2024-02-01', 'Millhouse', 'in' ,'16:15']],
    columns=['Date', 'Name', 'In/Out', 'Time'])

Solution

  • Assuming Time is sorted within a group, that the first In/Out is always In, and that In/Out are always alternating.

    You could convert the times to_datetime, then use groupby.apply to compute the diff, ignore the first/last value (with iloc) and sum the "IN" timedeltas before converting to total_seconds:

    # cleanup IN/OUT format
    df['In/Out'] = df['In/Out'].str.upper()
    
    out = (df
      .assign(dt=pd.to_datetime(df['Time'], format='%H:%M'))
      .groupby(['Date', 'Name'])
      .apply(lambda g: 
             g['dt'].diff().iloc[1:-1]
             [g['In/Out'].eq('IN')]
             .sum().total_seconds())
      .reset_index(name='TimeDelta')
      .query('TimeDelta>0') # optional: remove rows with null TimeDelta
    )
    

    Output:

             Date       Name  TimeDelta
    1  2024-01-01   J Bloggs     1800.0
    2  2024-01-01  M Simpson     2700.0
    

    NB. If any of the initial assumptions is incorrect, you just need to pre-process the data to sort it and remove invalid rows.

    Example:

    # cleanup IN/OUT format
    df['In/Out'] = df['In/Out'].str.upper()
    
    m1 = df.sort_values(by='Time').groupby(['Date', 'Name'])['In/Out'].shift(-1).ne(df['In/Out'])
    
    out = (df[m1]
      .assign(dt=pd.to_datetime(df.loc[m1, 'Time'], format='%H:%M'))
      .groupby(['Date', 'Name'])
      .apply(lambda g: 
             g['dt'].diff().iloc[1:-1]
             [g['In/Out'].eq('IN')]
             .sum().total_seconds())
      .reset_index(name='TimeDelta')
      .query('TimeDelta>0') # optional: remove rows with null TimeDelta
    )
    

    Output:

              Date   Name  TimeDelta
    1   2024-01-01   Bart     1200.0
    2   2024-01-01  Homer     1800.0
    3   2024-01-01   Lisa      900.0
    10  2024-02-01  Marge     3600.0