I have an extract where I'm required to identify a certain type of surgery X
seen in the Surg Type
column.
I need to retain medical appointments seen as distinct rows over a window/period of time, 3 appointment's prior (-3, -2, -1) and 3 appointment's post (+1, +2, +3)
I must include this order as an additional column.
On top of this I need to exclude any appointments outside of the window and any other type of Surg Type
, in this example any surgery's denoted Z.
In this example 7/9 rows/record's I want retained and an additional column Prior Post
*** UPDATED EXAMPLE ***
Original Df
| Patient ID | Surg ID | Surg Type | Surg Date | Medical Appt Date |
|------------|---------|-----------|------------|-------------------|
| 1 | 1 | X | 2022-09-03 | 2022-01-01 |
| 1 | 1 | X | 2022-09-03 | 2022-03-04 |
| 1 | 1 | X | 2022-09-03 | 2022-05-04 |
| 1 | 1 | X | 2022-09-03 | 2022-06-04 |
| 1 | 1 | X | 2022-09-03 | 2022-11-04 |
| 1 | 1 | X | 2022-09-03 | 2022-11-29 |
| 1 | 2 | Z | 2022-12-01 | 2022-11-01 |
| 1 | 1 | X | 2022-09-03 | 2023-01-02 |
| 1 | 1 | X | 2022-09-03 | 2023-01-13 |
Desired Df
| Patient ID | Surg ID | Surg Type | Surg Date | Medical Appt Date | Inclusion |
|------------|---------|-----------|------------|-------------------|-----------|
| 1 | 1 | X | 2022-09-03 | 2022-01-01 | Exclude |
| 1 | 1 | X | 2022-09-03 | 2022-03-04 | -3 |
| 1 | 1 | X | 2022-09-03 | 2022-05-04 | -2 |
| 1 | 1 | X | 2022-09-03 | 2022-06-04 | -1 |
| 1 | 1 | X | 2022-09-03 | 2022-11-04 | 1 |
| 1 | 1 | X | 2022-09-03 | 2022-11-29 | 2 |
| 1 | 2 | Z | 2022-12-01 | 2022-11-01 | -1 |
| 1 | 1 | X | 2022-09-03 | 2023-01-02 | 3 |
| 1 | 1 | X | 2022-09-03 | 2023-01-13 | Exclude |
You can filter the surgeries X
, then compute the time difference between the appointment and the surgery and compute the rank
for negative and positive numbers per group with groupby.apply
. Finally, mask the ranks outside of threshold.
# number of medical appointments to keep before/after a surgery
N = 3
# columns to use a grouper
group_cols = ['Patient ID', 'Surg ID']
# ensure datetime
df[['Surg Date', 'Medical Appt Date']] = df[['Surg Date', 'Medical Appt Date']].apply(pd.to_datetime)
# custom function to compute the rank of the Appt Date
def flag_nearest(g, N):
diff = g['Medical Appt Date'].sub(g['Surg Date'])
m = diff>'0'
ranks = (diff.where(m).rank()
.fillna(-diff.mask(m).rank(ascending=False))
# optional, to have integers in the output
.astype(int).astype(object)
)
return ranks.where(ranks.abs()<=N, 'Excluded').to_frame()
# identify non-X rows
m = df['Surg Type'].eq('X')
# compute the rank and flag columns outside of threshold
df['Inclusion'] = (df[m].groupby(group_cols)
.apply(flag_nearest, N)
.droplevel(group_cols)
# optional, to fill the non-X
.reindex(df.index, fill_value='non-X')
)
Output:
Patient ID Surg ID Surg Type Surg Date Medical Appt Date Inclusion
0 1 1 X 2022-09-03 2022-01-01 Excluded
1 1 1 X 2022-09-03 2022-03-04 -3
2 1 1 X 2022-09-03 2022-05-04 -2
3 1 1 X 2022-09-03 2022-06-04 -1
4 1 1 X 2022-09-03 2022-11-04 1
5 1 1 X 2022-09-03 2022-11-29 2
6 1 2 Z 2022-12-01 2022-11-01 non-X
7 1 1 X 2022-09-03 2023-01-02 3
8 1 1 X 2022-09-03 2023-01-13 Excluded
Extended example:
Patient ID Surg ID Surg Type Surg Date Medical Appt Date Inclusion
0 1 1 X 2022-09-03 2022-01-01 Excluded
1 1 1 X 2022-09-03 2022-03-04 -3
2 1 1 X 2022-09-03 2022-05-04 -2
3 1 1 X 2022-09-03 2022-06-04 -1
4 1 1 X 2022-09-03 2022-11-04 1
5 1 1 X 2022-09-03 2022-11-29 2
6 1 2 Z 2022-12-01 2022-11-01 non-X
7 1 1 X 2022-09-03 2023-01-02 3
8 1 1 X 2022-09-03 2023-01-13 Excluded
9 2 1 X 2022-09-05 2022-01-01 Excluded
10 2 1 X 2022-09-05 2022-01-02 Excluded
11 2 1 X 2022-09-05 2022-03-04 -3
12 2 1 X 2022-09-05 2022-05-04 -2
13 2 1 X 2022-09-05 2022-06-04 -1
14 2 1 X 2022-09-05 2022-11-04 1
15 2 1 X 2022-09-05 2022-11-29 2
16 2 2 Z 2022-12-02 2022-11-01 non-X
17 2 1 X 2022-09-05 2023-01-02 3
18 2 1 X 2022-09-05 2023-01-13 Excluded