Creating gap changes over time (start and end date)

I am still pretty new to pythons. I am trying to create a Python code to pull the extraction date in which a gap first occurred as the start_time. When the condition becomes "Y", then I would like to create an end_date based on the condition and populate the extraction date in which it became "Y". At the same time, I want to check it with previous dates before giving an end date. For example, a gap opened in Jan and closed in Feb, but it reopened in March and closed in April. Overall, I want to do a continuous check just in case I add more data in the future.

Sample Data table:

ID	sub_id	Condition	extraction_date
100	x	N	2024-01-15
100	x	Y	2024-02-01
100	y	Y	2024-02-01
101	z	N	2024-02-01
101	z	N	2024-03-16
100	x	N	2024-03-16
100	x	Y	2024-04-10
101	z	N	2024-04-10
101	z	Y	2024-05-15
102	w	N	2024-05-15
102	w	Y	2024-06-15

Overall, this is what I am expecting:

An example of a gap reopening occurs for ID 100 with sub_id of x, while a continuous gap occurs for ID 101 with sub_id of z.

ID	sub_id	Condition	extraction_date	start_date	end_date
100	x	N	2024-01-15	2024-01-15	N/A
100	x	Y	2024-02-01	2024-01-15	2024-02-01
100	y	Y	2024-02-01	2024-02-01	2024-02-01
101	z	N	2024-02-01	2024-02-01	N/A
101	z	N	2024-03-16	2024-02-01	N/A
100	x	N	2024-03-16	2024-03-16	N/A
100	x	Y	2024-04-10	2024-03-16	2024-04-10
101	z	N	2024-04-10	2024-02-01	N/A
101	z	Y	2024-05-15	2024-02-01	2024-05-15
102	w	N	2024-05-15	2024-05-15	N/A
102	w	Y	2024-06-15	2024-05-15	2024-06-15

Solution

Sort the values based on ID and sub_id and use the condition "Condition == Y" to create the groups. After that group and use transform to get the dates for the columns. Use pandas where with the first condition(Condition == Y) to remove the dates that are unnecessary.

m = df.sort_values(by=['ID', 'sub_id'])['Condition'].eq('Y')
g = m[::-1].cumsum().sort_index()
grp = df.groupby(['ID', 'sub_id', g])['extraction_date']

df['start_date'] = grp.transform('min')
df['end_date'] = grp.transform('max').where(m)

End result:

 ID sub_id Condition extraction_date      start        End
100      x         N      2024-01-15 2024-01-15        NaT
100      x         Y      2024-02-01 2024-01-15 2024-02-01
100      y         Y      2024-02-01 2024-02-01 2024-02-01
101      z         N      2024-02-01 2024-02-01        NaT
101      z         N      2024-03-16 2024-02-01        NaT
100      x         N      2024-03-16 2024-03-16        NaT
100      x         Y      2024-04-10 2024-03-16 2024-04-10
101      z         N      2024-04-10 2024-02-01        NaT
101      z         Y      2024-05-15 2024-02-01 2024-05-15
102      w         N      2024-05-15 2024-05-15        NaT
102      w         Y      2024-06-15 2024-05-15 2024-06-15

Edit: A slightly modified version in case the groups don't form correctly using the first solution

m = df['Condition'].eq('Y')
g = (df.groupby(['ID', 'sub_id'])['Condition']
     .transform(lambda g: g[::-1].eq('Y').cumsum())
     .to_numpy() # or reset_index(drop=True)
    )

grps = df.groupby(['ID', 'sub_id', g])['extraction_date']

df['start_date'] = grps.transform('min')
df['end_date'] = grps.transform('max').where(m)