In python I would like to modify a dataset. I have a transaction based data set as per the code below:
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}
WINDOW groups the transactions on each card by 48 hours. E.g. A transaction is attributed WINDOW = 1 if it falls in the the first 48 hours of transactions (i.e. the first transaction on a card is always WINDOW=1). WINDOW = 2 will include transactions in the second 48 hours (relative to the first transaction on that card).
COUNT_CURRENT_WINDOW is a cumulative count of transactions occuring in the current row's WINDOW group.
Using pandas (and preferably without loops), I'd like to code 3 new columns that display what the count of transactions was in the previous 3 WINDOW groups per card. I.e.
COUNT_PREV_WINDOW = total count of transactions on the card occuring in (current row's (WINDOW value) minus 1 COUNT_ 2WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 2 COUNT_3WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 3
There will be occasions where there were no transactions in the previous WINDOW group. E.g. When the current row's WINDOW =1. Another example is if the current row's WINDOW = 4 but no transactions have occured since WINDOW = 1. The desired output below demonstrates this for card=234).
The desired output should looks like the following with no changes to sorting:
data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2],
'COUNT_PREV_WINDOW ': [0, 0, 2, 2, 2, 3, 0, 0, 0, 0],
'COUNT_2WINDOWS_AGO ': [0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
'COUNT_3WINDOWS_AGO': [0, 0, 0, 0, 0, 0, 0, 0, 2, 2]}
I would prefer not to use loops due to efficiency as it as the real dataset is very large.
I have tried the following and similar variations but it just counts the current group and outputs the max.:
df['COUNT_PREV_WINDOW'] = df.groupby(['CARD_NUM',df['WINDOW']-1])['COUNT_CURRENT_WINDOW'].max().fillna(0)
Code
df1 = (pd.crosstab(df['CARD_NUM'], df['WINDOW'])
.stack().groupby(level=0).shift(1).fillna(0).astype('int')
.reset_index(name='COUNT_PREV_WINDOW')
.assign(COUNT_2WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(1).fillna(0).astype('int'))
.assign(COUNT_3WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(2).fillna(0).astype('int')))
df1
CARD_NUM WINDOW COUNT_PREV_WINDOW COUNT_2WINDOWS_AGO COUNT_3WINDOWS_AGO
0 123 1 0 0 0
1 123 2 2 0 0
2 123 3 3 2 0
3 123 4 1 3 2
4 234 1 0 0 0
5 234 2 2 0 0
6 234 3 0 2 0
7 234 4 0 0 2
merge df and df1
out = df.merge(df1, how='left')
out
(image)
I also don't want to attach the output as image, but your column names are too long. It's inconvenient even when solve problem, please keep it brief from next time.