Search code examples
pandasdataframedate

Flag consecutive dates Python pandas


I have the dataframe

    data = {
    'MSISDN': [x12, x12, x12, x23, x23],
    'SERVICE_FROM': ['2023-02-27 15:50:29', '2023-07-19 08:33:19', '2023-10-09 07:45:49', '2023-06-01 09:07:37', '2023-08-22 15:54:25'],
    'SERVICE_TO': ['2023-03-07 16:00:18', '2023-10-09 07:45:49', '2024-01-09 07:45:49', '2023-08-22 15:54:25', '2023-11-12 17:00:50'],
    'SERVICE_CODE': ['TC', 'TC', 'TC', 'TC', 'TC'],
    'SERVICE_VALUE': [10.0, 35.0, 35.0, 15.0, 15.0],
    'TC_TIMES': [1, 2, 3, 1, 2],
    'TC_TRIM': [float('nan'), 25.0, 0.0, float('nan'), 0.0]
}

I want to create a new column called TC_CONSECUTIVE that flags whether for a specific MSISDN, there are consecutive occurrences of TC - i.e. when the SERVICE_TO from one row equals SERVICE_FROM of the next row (for the same MSISDN, not based on time just looking at dates).

I've tried using

valid_msisdns.sort_values(by=['MSISDN', 'SERVICE_FROM'], inplace=True)

# Create a new column 'TC_CONSECUTIVE'
valid_msisdns['TC_CONSECUTIVE'] = (valid_msisdns['SERVICE_FROM'] == valid_msisdns['SERVICE_TO']\
                                   .shift(1)).replace({True: 'Yes', False: 'No'})

# Handle the first row for each MSISDN
valid_msisdns.loc[valid_msisdns.duplicated(subset='MSISDN', keep='first'), 'TC_CONSECUTIVE'] = 'No'

But this doesn't work as expected. The expected output is

    data = {
    'MSISDN': [x12, x12, x12, x23, x23],
    'SERVICE_FROM': ['2023-02-27 15:50:29', '2023-07-19 08:33:19', '2023-10-09 07:45:49', '2023-06-01 09:07:37', '2023-08-22 15:54:25'],
    'SERVICE_TO': ['2023-03-07 16:00:18', '2023-10-09 07:45:49', '2024-01-09 07:45:49', '2023-08-22 15:54:25', '2023-11-12 17:00:50'],
    'SERVICE_CODE': ['TC', 'TC', 'TC', 'TC', 'TC'],
    'SERVICE_VALUE': [10.0, 35.0, 35.0, 15.0, 15.0],
    'TC_TIMES': [1, 2, 3, 1, 2],
    'TC_TRIM': [float('nan'), 25.0, 0.0, float('nan'), 0.0],
    'TC_CONSECUTIVE': ['No','Yes','Yes','Yes','Yes']
}

Because where it is 'Yes' there are consecutive occurrences of TC.

Thanks in advance!!


Solution

  • If I understood your question correctly, you can use groupby with an apply and shift to do your task:

    Assuming the last entry of a group which has no next entry to be compared against is "YES" by default.

    def group_func(g, col = 'TC_CONSECUTIVE'):
      g[col]= g['SERVICE_FROM'].shift(-1) == g['SERVICE_TO']
      g[col].iloc[-1] = True
      g[col] = g[col].replace({True: 'Yes', False: 'No'})
      return g
    
    print(df.groupby('MSISDN', group_keys=True).apply(lambda g: group_func(g)).reset_index(drop=True))
    
    

    Output:

      MSISDN         SERVICE_FROM           SERVICE_TO SERVICE_CODE  \
    0    x12  2023-02-27 15:50:29  2023-03-07 16:00:18           TC   
    1    x12  2023-07-19 08:33:19  2023-10-09 07:45:49           TC   
    2    x12  2023-10-09 07:45:49  2024-01-09 07:45:49           TC   
    3    x23  2023-06-01 09:07:37  2023-08-22 15:54:25           TC   
    4    x23  2023-08-22 15:54:25  2023-11-12 17:00:50           TC   
    
       SERVICE_VALUE  TC_TIMES  TC_TRIM TC_CONSECUTIVE  
    0           10.0         1      NaN             No  
    1           35.0         2     25.0            Yes  
    2           35.0         3      0.0            Yes  
    3           15.0         1      NaN            Yes  
    4           15.0         2      0.0            Yes