Search code examples
pythonpandasdataframenumpypython-datetime

Lengths of overlapping time ranges listed by rows


I am using pandas version 1.0.5

The example dataframe below lists time intervals, recorded over three days, and I seek where some time intervals overlap every day.

Interim- for illustration

For example, one of the overlapping time across all the three dates (yellow highlighted) is 1:16 - 2:13. The other (blue highlighted) would be 18:45 - 19:00

So my expected output would be like: [57,15] because

  • 57 - Minutes between 1:16 - 2:13.
  • 15 - Minutes between 18:45 - 19:00

Please use this generator of the input dataframe:

import pandas as pd
dat1 = [
    ['2023-12-27','2023-12-27 00:00:00','2023-12-27 02:14:00'],
    ['2023-12-27','2023-12-27 03:16:00','2023-12-27 04:19:00'],
    ['2023-12-27','2023-12-27 18:11:00','2023-12-27 20:13:00'],
    ['2023-12-28','2023-12-28 01:16:00','2023-12-28 02:14:00'],
    ['2023-12-28','2023-12-28 02:16:00','2023-12-28 02:28:00'],
    ['2023-12-28','2023-12-28 02:30:00','2023-12-28 02:56:00'],
    ['2023-12-28','2023-12-28 18:45:00','2023-12-28 19:00:00'],
    ['2023-12-29','2023-12-29 01:16:00','2023-12-29 02:13:00'],
    ['2023-12-29','2023-12-29 04:16:00','2023-12-29 05:09:00'],
    ['2023-12-29','2023-12-29 05:11:00','2023-12-29 05:14:00'],
    ['2023-12-29','2023-12-29 18:00:00','2023-12-29 19:00:00']
       ]
df = pd.DataFrame(dat1,columns = ['date','Start_tmp','End_tmp'])
df["Start_tmp"] = pd.to_datetime(df["Start_tmp"])
df["End_tmp"] = pd.to_datetime(df["End_tmp"])

Solution

  • This solution uses:

    • numpy, no uncommon Python modules, so using Python 1.0.5 you should, hopefully, be in the clear,
    • no nested loops to care for speed issues with growing dataset,

    Method:

    • Draw the landscape of overlaps
    • Then select the overlaps corresponding to the number of documented days,
    • Finally describe the overlaps in terms of their lengths

    Number of documented days: (as in Python: Convert timedelta to int in a dataframe)

    n = 1 + ( max(df['End_tmp']) - min(df['Start_tmp']) ).days
    n
    3
    

    Additive landscape:

    # initial flat whole-day landcape (height: 0)
    L = np.zeros(24*60, dtype='int')
    # add up ranges: (reused @sammywemmy's perfect formula for time of day in minutes)
    for start, end in zip(df['Start_tmp'].dt.hour.mul(60) + df['Start_tmp'].dt.minute,  # Start_tmp timestamps expressed in minutes
                          df['End_tmp'].dt.hour.mul(60)   + df['End_tmp'].dt.minute):   # End_tmp timestamps expressed in minutes
        L[start:end+1] += 1
    
    plt.plot(L)
    plt.hlines(y=[2,3],xmin=0,xmax=1400,colors=['green','red'], linestyles='dashed')
    plt.xlabel('time of day (minutes)')
    plt.ylabel('time range overlaps')
    

    additive landscape (Please excuse the typo: these are obviously minutes, not seconds)

    Keep only overlaps over all days: (red line, n=3)

    # Reduce heights <n to 0 because not overlaping every day
    L[L<n]=0
    # Simplify all greater values to 1 because only their presence matters
    L[L>0]=1
    # Now only overlaps are highlighted
    # (Actually this latest line is disposable, provided we filter all but the overlaps of rank n. Useful only if you were to include lower overlaps)
    

    Extract overlap ranges and their lengths

    # Highlight edges of overlaping intervals
    D = np.diff(L)
    # Describe overlaps as ranges
    R = list(zip([a[0]   for a in np.argwhere(D>0)],  # indices where overlaps *begin*, with scalar indices instead of arrays
                 [a[0]-1 for a in np.argwhere(D<0)])) # indices where overlaps *end*, with scalar indices instead of arrays
    R
    [(75, 132), (1124, 1139)]
    
    # Finally their lengths
    [b-a for a,b in R]
    

    Final output: [57, 15]