Search code examples
pythonpandastransactions

How do I assign a count to a new column, counting the rows in a groupby when the current row doesn't belong in the group being counted?


In python I would like to modify a dataset. I have a transaction based data set as per the code below:

data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2]}

WINDOW groups the transactions on each card by 48 hours. E.g. A transaction is attributed WINDOW = 1 if it falls in the the first 48 hours of transactions (i.e. the first transaction on a card is always WINDOW=1). WINDOW = 2 will include transactions in the second 48 hours (relative to the first transaction on that card).

COUNT_CURRENT_WINDOW is a cumulative count of transactions occuring in the current row's WINDOW group.

Using pandas (and preferably without loops), I'd like to code 3 new columns that display what the count of transactions was in the previous 3 WINDOW groups per card. I.e.

COUNT_PREV_WINDOW = total count of transactions on the card occuring in (current row's (WINDOW value) minus 1 COUNT_ 2WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 2 COUNT_3WINDOWS_AGO = total count of transactions on the card occuring in (current row's WINDOW value) minus 3

There will be occasions where there were no transactions in the previous WINDOW group. E.g. When the current row's WINDOW =1. Another example is if the current row's WINDOW = 4 but no transactions have occured since WINDOW = 1. The desired output below demonstrates this for card=234).

The desired output should looks like the following with no changes to sorting:

data = {
'TRANSACTIONDATETIME': ['1-Jan-23', '2-Jan-23', '3-Jan-23', '4-Jan-23', '4-Jan-23', '5-Jan-23', '2-Jan-23', '3-Jan-23', '8-Jan-23', '9-Jan-23'],
'CARD_NUM': [123, 123, 123, 123, 123, 123, 234, 234, 234, 234],
'WINDOW': [1, 1, 2, 2, 2, 3, 1, 1, 4, 4],
'COUNT_CURRENT_WINDOW': [1, 2, 1, 2, 3, 1, 1, 2, 1, 2],
'COUNT_PREV_WINDOW ': [0, 0, 2, 2, 2, 3, 0, 0, 0, 0],
'COUNT_2WINDOWS_AGO ': [0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
'COUNT_3WINDOWS_AGO': [0, 0, 0, 0, 0, 0, 0, 0, 2, 2]}

I would prefer not to use loops due to efficiency as it as the real dataset is very large.

I have tried the following and similar variations but it just counts the current group and outputs the max.:

df['COUNT_PREV_WINDOW'] = df.groupby(['CARD_NUM',df['WINDOW']-1])['COUNT_CURRENT_WINDOW'].max().fillna(0)

Solution

  • Code

    df1 = (pd.crosstab(df['CARD_NUM'], df['WINDOW'])
             .stack().groupby(level=0).shift(1).fillna(0).astype('int')
             .reset_index(name='COUNT_PREV_WINDOW')
             .assign(COUNT_2WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(1).fillna(0).astype('int'))
             .assign(COUNT_3WINDOWS_AGO=lambda x: x['COUNT_PREV_WINDOW'].groupby(x['CARD_NUM']).shift(2).fillna(0).astype('int')))
    

    df1

    CARD_NUM    WINDOW  COUNT_PREV_WINDOW   COUNT_2WINDOWS_AGO  COUNT_3WINDOWS_AGO
    0   123     1       0                   0                   0
    1   123     2       2                   0                   0
    2   123     3       3                   2                   0
    3   123     4       1                   3                   2
    4   234     1       0                   0                   0
    5   234     2       2                   0                   0
    6   234     3       0                   2                   0
    7   234     4       0                   0                   2
    

    merge df and df1

    out = df.merge(df1, how='left')
    

    out(image)

    out

    I also don't want to attach the output as image, but your column names are too long. It's inconvenient even when solve problem, please keep it brief from next time.