Search code examples
pythonpandasdata-sciencefeature-engineering

Speeding up for-loops using pandas for feature engineering


I have a dataframe with the following headings:

  • payer
  • recipient_country
  • date of payment

Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.

For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.

This is for feature engineering purposes.

I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.

Can anyone think of a way to speed up this process using some fast pandas functions?

Thanks!


Solution

  • Testing on this toy data frame:

    df = pd.DataFrame(
        [{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
         {'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
         {'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
         {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
         {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
         {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
         {'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
    )
    

    Just group by and cumulatively count:

    >>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
      name country       date  trns_bf
    0  Bob      UK 2023-01-01        0
    1  Bob      UK 2023-01-02        1
    2  Bob      UK 2023-01-03        2
    3  Cob      UK 2023-01-04        0
    4  Cob      UK 2023-01-05        1
    5  Cob      UK 2023-01-06        2
    6  Cob      UK 2023-01-07        3
    

    You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.

    Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).