Search code examples
pythonpandasdataframenumpy

Creating gap changes over time (start and end date)


I am still pretty new to pythons. I am trying to create a Python code to pull the extraction date in which a gap first occurred as the start_time. When the condition becomes "Y", then I would like to create an end_date based on the condition and populate the extraction date in which it became "Y". At the same time, I want to check it with previous dates before giving an end date. For example, a gap opened in Jan and closed in Feb, but it reopened in March and closed in April. Overall, I want to do a continuous check just in case I add more data in the future.

Sample Data table:

ID sub_id Condition extraction_date
100 x N 2024-01-15
100 x Y 2024-02-01
100 y Y 2024-02-01
101 z N 2024-02-01
101 z N 2024-03-16
100 x N 2024-03-16
100 x Y 2024-04-10
101 z N 2024-04-10
101 z Y 2024-05-15
102 w N 2024-05-15
102 w Y 2024-06-15

Overall, this is what I am expecting:

An example of a gap reopening occurs for ID 100 with sub_id of x, while a continuous gap occurs for ID 101 with sub_id of z.

ID sub_id Condition extraction_date start_date end_date
100 x N 2024-01-15 2024-01-15 N/A
100 x Y 2024-02-01 2024-01-15 2024-02-01
100 y Y 2024-02-01 2024-02-01 2024-02-01
101 z N 2024-02-01 2024-02-01 N/A
101 z N 2024-03-16 2024-02-01 N/A
100 x N 2024-03-16 2024-03-16 N/A
100 x Y 2024-04-10 2024-03-16 2024-04-10
101 z N 2024-04-10 2024-02-01 N/A
101 z Y 2024-05-15 2024-02-01 2024-05-15
102 w N 2024-05-15 2024-05-15 N/A
102 w Y 2024-06-15 2024-05-15 2024-06-15

Solution

  • Sort the values based on ID and sub_id and use the condition "Condition == Y" to create the groups. After that group and use transform to get the dates for the columns. Use pandas where with the first condition(Condition == Y) to remove the dates that are unnecessary.

    m = df.sort_values(by=['ID', 'sub_id'])['Condition'].eq('Y')
    g = m[::-1].cumsum().sort_index()
    grp = df.groupby(['ID', 'sub_id', g])['extraction_date']
    
    df['start_date'] = grp.transform('min')
    df['end_date'] = grp.transform('max').where(m)
    

    End result:

     ID sub_id Condition extraction_date      start        End
    100      x         N      2024-01-15 2024-01-15        NaT
    100      x         Y      2024-02-01 2024-01-15 2024-02-01
    100      y         Y      2024-02-01 2024-02-01 2024-02-01
    101      z         N      2024-02-01 2024-02-01        NaT
    101      z         N      2024-03-16 2024-02-01        NaT
    100      x         N      2024-03-16 2024-03-16        NaT
    100      x         Y      2024-04-10 2024-03-16 2024-04-10
    101      z         N      2024-04-10 2024-02-01        NaT
    101      z         Y      2024-05-15 2024-02-01 2024-05-15
    102      w         N      2024-05-15 2024-05-15        NaT
    102      w         Y      2024-06-15 2024-05-15 2024-06-15
    

    Edit: A slightly modified version in case the groups don't form correctly using the first solution

    m = df['Condition'].eq('Y')
    g = (df.groupby(['ID', 'sub_id'])['Condition']
         .transform(lambda g: g[::-1].eq('Y').cumsum())
         .to_numpy() # or reset_index(drop=True)
        )
    
    grps = df.groupby(['ID', 'sub_id', g])['extraction_date']
    
    df['start_date'] = grps.transform('min')
    df['end_date'] = grps.transform('max').where(m)