I am still pretty new to pythons. I am trying to create a Python code to pull the extraction date in which a gap first occurred as the start_time. When the condition becomes "Y", then I would like to create an end_date based on the condition and populate the extraction date in which it became "Y". At the same time, I want to check it with previous dates before giving an end date. For example, a gap opened in Jan and closed in Feb, but it reopened in March and closed in April. Overall, I want to do a continuous check just in case I add more data in the future.
Sample Data table:
ID | sub_id | Condition | extraction_date |
---|---|---|---|
100 | x | N | 2024-01-15 |
100 | x | Y | 2024-02-01 |
100 | y | Y | 2024-02-01 |
101 | z | N | 2024-02-01 |
101 | z | N | 2024-03-16 |
100 | x | N | 2024-03-16 |
100 | x | Y | 2024-04-10 |
101 | z | N | 2024-04-10 |
101 | z | Y | 2024-05-15 |
102 | w | N | 2024-05-15 |
102 | w | Y | 2024-06-15 |
Overall, this is what I am expecting:
An example of a gap reopening occurs for ID 100 with sub_id of x, while a continuous gap occurs for ID 101 with sub_id of z.
ID | sub_id | Condition | extraction_date | start_date | end_date |
---|---|---|---|---|---|
100 | x | N | 2024-01-15 | 2024-01-15 | N/A |
100 | x | Y | 2024-02-01 | 2024-01-15 | 2024-02-01 |
100 | y | Y | 2024-02-01 | 2024-02-01 | 2024-02-01 |
101 | z | N | 2024-02-01 | 2024-02-01 | N/A |
101 | z | N | 2024-03-16 | 2024-02-01 | N/A |
100 | x | N | 2024-03-16 | 2024-03-16 | N/A |
100 | x | Y | 2024-04-10 | 2024-03-16 | 2024-04-10 |
101 | z | N | 2024-04-10 | 2024-02-01 | N/A |
101 | z | Y | 2024-05-15 | 2024-02-01 | 2024-05-15 |
102 | w | N | 2024-05-15 | 2024-05-15 | N/A |
102 | w | Y | 2024-06-15 | 2024-05-15 | 2024-06-15 |
Sort the values based on ID and sub_id and use the condition "Condition == Y" to create the groups. After that group and use transform to get the dates for the columns. Use pandas where with the first condition(Condition == Y) to remove the dates that are unnecessary.
m = df.sort_values(by=['ID', 'sub_id'])['Condition'].eq('Y')
g = m[::-1].cumsum().sort_index()
grp = df.groupby(['ID', 'sub_id', g])['extraction_date']
df['start_date'] = grp.transform('min')
df['end_date'] = grp.transform('max').where(m)
End result:
ID sub_id Condition extraction_date start End
100 x N 2024-01-15 2024-01-15 NaT
100 x Y 2024-02-01 2024-01-15 2024-02-01
100 y Y 2024-02-01 2024-02-01 2024-02-01
101 z N 2024-02-01 2024-02-01 NaT
101 z N 2024-03-16 2024-02-01 NaT
100 x N 2024-03-16 2024-03-16 NaT
100 x Y 2024-04-10 2024-03-16 2024-04-10
101 z N 2024-04-10 2024-02-01 NaT
101 z Y 2024-05-15 2024-02-01 2024-05-15
102 w N 2024-05-15 2024-05-15 NaT
102 w Y 2024-06-15 2024-05-15 2024-06-15
Edit: A slightly modified version in case the groups don't form correctly using the first solution
m = df['Condition'].eq('Y')
g = (df.groupby(['ID', 'sub_id'])['Condition']
.transform(lambda g: g[::-1].eq('Y').cumsum())
.to_numpy() # or reset_index(drop=True)
)
grps = df.groupby(['ID', 'sub_id', g])['extraction_date']
df['start_date'] = grps.transform('min')
df['end_date'] = grps.transform('max').where(m)