Below dataset is my dataset and want expected output (Shown below) using overlaps function:
Input dataset:
values_1 | Id | DeptId | Start_DateTime | End_DateTime |
---|---|---|---|---|
PNM | 4579 | FHDU | 2023-09-04 14:15:29 | 2023-09-04 18:25:22 |
PNM | 1278 | FHDU | 2023-09-04 14:45:28 | 2023-09-04 18:35:19 |
POH | 8579 | ASKG | 2023-09-04 15:35:29 | 2023-09-04 17:25:22 |
POH | 3449 | ASKG | 2023-09-04 15:45:28 | 2023-09-04 18:35:19 |
WEH | 9124 | ASKG | 2023-09-04 17:25:28 | 2023-09-04 19:43:13 |
WEH | 4579 | ASKG | 2023-09-04 16:15:21 | 2023-09-04 18:24:02 |
Expected Output:
values_1 | Id | DeptId | Start_DateTime | End_DateTime |
---|---|---|---|---|
PNM: WEH | 4579 | FHDU: ASKG | 2023-09-04 14:15:29: 2023-09-04 16:15:21 | 2023-09-04 18:25:22: 2023-09-04 18:24:0 |
For this code below, we know that it provides overlapping for 2 intervals (Start_DateTime and End_DateTime) but my question is: How do I incorporate "Id" column to be included so the output is showing same as Expected Output:.
Thank you for your time and appreciate your help in advance!
IIUC, you want to get the first and last item per group. You can use a custom function for groupby.aggregate
:
tmp = df.sort_values(['Start_DateTime', 'End_DateTime'])
group = tmp['Start_DateTime'].gt(tmp['End_DateTime'].shift()).cumsum()
def first_last(x):
return f'{x.iloc[0]}:{x.iloc[-1]}'
out = (df
.groupby(group)
.agg({'values_1': first_last,
'Id': 'first',
'DeptId': first_last,
'Start_DateTime': first_last,
'End_DateTime': first_last
})
)
Output:
values_1 Id DeptId Start_DateTime End_DateTime
0 PNM:WEH 4579 FHDU:ASKG 2023-09-04 14:15:29:2023-09-04 16:15:21 2023-09-04 18:25:22:2023-09-04 18:24:02