Search code examples
pythonpandaspython-datetime

Keep rows with true overlaps time period in pandas using specific columns values


I am working with a data frame shown below. Within there are 2 columns that I would like to find out true time overlap rows based on values_1 column. The values_1 column has multiple item that repeats and from that I would only like to extract rows where the start date_time and end date_time (both datetime) overlap for that specific value within values_1 column.

Below is data frame I have:

values_1 uniqId DeptId Start DateTime End DateTime
VGM 1278 BKNG 2023-09-05 18:35:28 2023-09-05 20:05:28
VGM 1259 BKNG 2023-09-05 18:55:18 2023-09-05 19:25:38
VGM 2567 BKNG 2023-09-05 14:29:38 2023-09-05 17:35:28
WEH 9854 ASKG 2023-09-05 13:45:58 2023-09-05 15:00:00
WEH 9124 ASKG 2023-09-05 17:25:28 2023-09-05 19:43:13
WEH 9785 ASKG 2023-09-05 16:15:21 2023-09-05 18:24:02
POH 3479 ASKG 2023-09-05 15:35:29 2023-09-05 17:25:22
POH 3449 ASKG 2023-09-05 17:35:28 2023-09-05 18:35:19

And the output table I would like to achieve is:

values_1 uniqId DeptId Start DateTime End DateTime
VGM 1278 BKNG 2023-09-05 18:35:28 2023-09-05 20:05:28
VGM 1259 BKNG 2023-09-05 18:55:18 2023-09-05 19:25:38
WEH 9124 ASKG 2023-09-05 17:25:28 2023-09-05 19:43:13
WEH 9785 ASKG 2023-09-05 16:15:21 2023-09-05 18:24:02

So from above output table, POH is completely removed because no time overlaps. Along with VGM and WEH 2 rows are dropped because their times did not overlap with among values_1 items.

Can someone please help me with this?

Thanks in advance.

I looked at different overlapping time periods example with no luck that I can apply to my data frame.

Thanks!


Solution

  • You can sort the times, then use a groupby.shift to compare the successive rows for boolean indexing:

    # ensure datetime
    df[['Start DateTime', 'End DateTime']] = df[['Start DateTime', 'End DateTime']].apply(pd.to_datetime)
    
    # sort values
    tmp = df.sort_values(by=['values_1', 'Start DateTime', 'End DateTime'])
    
    # are consecutive intervals overlapping?
    m = tmp['Start DateTime'].lt(tmp.groupby('values_1')['End DateTime'].shift())
    
    # flag rows and previous ones
    out = df.loc[m|m.groupby(tmp['values_1']).shift(-1)]
    

    Output:

      values_1  uniqId DeptId      Start DateTime        End DateTime
    0      VGM    1278   BKNG 2023-09-05 18:35:28 2023-09-05 20:05:28
    1      VGM    1259   BKNG 2023-09-05 18:55:18 2023-09-05 19:25:38
    4      WEH    9124   ASKG 2023-09-05 17:25:28 2023-09-05 19:43:13
    5      WEH    9785   ASKG 2023-09-05 16:15:21 2023-09-05 18:24:02
    

    Intermediates:

      values_1  uniqId DeptId      Start DateTime        End DateTime           end_shift      m m_shift  m|m_shift
    6      POH    3479   ASKG 2023-09-05 15:35:29 2023-09-05 17:25:22                 NaT  False   False      False
    7      POH    3449   ASKG 2023-09-05 17:35:28 2023-09-05 18:35:19 2023-09-05 17:25:22  False     NaN      False
    2      VGM    2567   BKNG 2023-09-05 14:29:38 2023-09-05 17:35:28                 NaT  False   False      False
    0      VGM    1278   BKNG 2023-09-05 18:35:28 2023-09-05 20:05:28 2023-09-05 17:35:28  False    True       True
    1      VGM    1259   BKNG 2023-09-05 18:55:18 2023-09-05 19:25:38 2023-09-05 20:05:28   True     NaN       True
    3      WEH    9854   ASKG 2023-09-05 13:45:58 2023-09-05 15:00:00                 NaT  False   False      False
    5      WEH    9785   ASKG 2023-09-05 16:15:21 2023-09-05 18:24:02 2023-09-05 15:00:00  False    True       True
    4      WEH    9124   ASKG 2023-09-05 17:25:28 2023-09-05 19:43:13 2023-09-05 18:24:02   True     NaN       True