I have the dataframe
data = {
'MSISDN': [x12, x12, x12, x23, x23],
'SERVICE_FROM': ['2023-02-27 15:50:29', '2023-07-19 08:33:19', '2023-10-09 07:45:49', '2023-06-01 09:07:37', '2023-08-22 15:54:25'],
'SERVICE_TO': ['2023-03-07 16:00:18', '2023-10-09 07:45:49', '2024-01-09 07:45:49', '2023-08-22 15:54:25', '2023-11-12 17:00:50'],
'SERVICE_CODE': ['TC', 'TC', 'TC', 'TC', 'TC'],
'SERVICE_VALUE': [10.0, 35.0, 35.0, 15.0, 15.0],
'TC_TIMES': [1, 2, 3, 1, 2],
'TC_TRIM': [float('nan'), 25.0, 0.0, float('nan'), 0.0]
}
I want to create a new column called TC_CONSECUTIVE that flags whether for a specific MSISDN, there are consecutive occurrences of TC - i.e. when the SERVICE_TO from one row equals SERVICE_FROM of the next row (for the same MSISDN, not based on time just looking at dates).
I've tried using
valid_msisdns.sort_values(by=['MSISDN', 'SERVICE_FROM'], inplace=True)
# Create a new column 'TC_CONSECUTIVE'
valid_msisdns['TC_CONSECUTIVE'] = (valid_msisdns['SERVICE_FROM'] == valid_msisdns['SERVICE_TO']\
.shift(1)).replace({True: 'Yes', False: 'No'})
# Handle the first row for each MSISDN
valid_msisdns.loc[valid_msisdns.duplicated(subset='MSISDN', keep='first'), 'TC_CONSECUTIVE'] = 'No'
But this doesn't work as expected. The expected output is
data = {
'MSISDN': [x12, x12, x12, x23, x23],
'SERVICE_FROM': ['2023-02-27 15:50:29', '2023-07-19 08:33:19', '2023-10-09 07:45:49', '2023-06-01 09:07:37', '2023-08-22 15:54:25'],
'SERVICE_TO': ['2023-03-07 16:00:18', '2023-10-09 07:45:49', '2024-01-09 07:45:49', '2023-08-22 15:54:25', '2023-11-12 17:00:50'],
'SERVICE_CODE': ['TC', 'TC', 'TC', 'TC', 'TC'],
'SERVICE_VALUE': [10.0, 35.0, 35.0, 15.0, 15.0],
'TC_TIMES': [1, 2, 3, 1, 2],
'TC_TRIM': [float('nan'), 25.0, 0.0, float('nan'), 0.0],
'TC_CONSECUTIVE': ['No','Yes','Yes','Yes','Yes']
}
Because where it is 'Yes' there are consecutive occurrences of TC.
Thanks in advance!!
If I understood your question correctly, you can use groupby
with an apply
and shift
to do your task:
Assuming the last entry of a group which has no next entry to be compared against is "YES" by default.
def group_func(g, col = 'TC_CONSECUTIVE'):
g[col]= g['SERVICE_FROM'].shift(-1) == g['SERVICE_TO']
g[col].iloc[-1] = True
g[col] = g[col].replace({True: 'Yes', False: 'No'})
return g
print(df.groupby('MSISDN', group_keys=True).apply(lambda g: group_func(g)).reset_index(drop=True))
Output:
MSISDN SERVICE_FROM SERVICE_TO SERVICE_CODE \
0 x12 2023-02-27 15:50:29 2023-03-07 16:00:18 TC
1 x12 2023-07-19 08:33:19 2023-10-09 07:45:49 TC
2 x12 2023-10-09 07:45:49 2024-01-09 07:45:49 TC
3 x23 2023-06-01 09:07:37 2023-08-22 15:54:25 TC
4 x23 2023-08-22 15:54:25 2023-11-12 17:00:50 TC
SERVICE_VALUE TC_TIMES TC_TRIM TC_CONSECUTIVE
0 10.0 1 NaN No
1 35.0 2 25.0 Yes
2 35.0 3 0.0 Yes
3 15.0 1 NaN Yes
4 15.0 2 0.0 Yes