Search code examples
pythonpandasgroup-bypython-itertools

How to find out overlap and combine them into one row?


Below dataset is my dataset and want expected output (Shown below) using overlaps function:

Input dataset:

values_1 Id DeptId Start_DateTime End_DateTime
PNM 4579 FHDU 2023-09-04 14:15:29 2023-09-04 18:25:22
PNM 1278 FHDU 2023-09-04 14:45:28 2023-09-04 18:35:19
POH 8579 ASKG 2023-09-04 15:35:29 2023-09-04 17:25:22
POH 3449 ASKG 2023-09-04 15:45:28 2023-09-04 18:35:19
WEH 9124 ASKG 2023-09-04 17:25:28 2023-09-04 19:43:13
WEH 4579 ASKG 2023-09-04 16:15:21 2023-09-04 18:24:02

Expected Output:

values_1 Id DeptId Start_DateTime End_DateTime
PNM: WEH 4579 FHDU: ASKG 2023-09-04 14:15:29: 2023-09-04 16:15:21 2023-09-04 18:25:22: 2023-09-04 18:24:0

For this code below, we know that it provides overlapping for 2 intervals (Start_DateTime and End_DateTime) but my question is: How do I incorporate "Id" column to be included so the output is showing same as Expected Output:.

Thank you for your time and appreciate your help in advance!


Solution

  • IIUC, you want to get the first and last item per group. You can use a custom function for groupby.aggregate:

    tmp = df.sort_values(['Start_DateTime', 'End_DateTime'])
    group = tmp['Start_DateTime'].gt(tmp['End_DateTime'].shift()).cumsum()
    
    def first_last(x):
        return f'{x.iloc[0]}:{x.iloc[-1]}'
        
    out = (df
      .groupby(group)
      .agg({'values_1': first_last,
            'Id': 'first',
            'DeptId': first_last,
            'Start_DateTime': first_last,
            'End_DateTime': first_last
           })
    )
    

    Output:

      values_1    Id     DeptId                           Start_DateTime                             End_DateTime
    0  PNM:WEH  4579  FHDU:ASKG  2023-09-04 14:15:29:2023-09-04 16:15:21  2023-09-04 18:25:22:2023-09-04 18:24:02