Search code examples
pythonpandasdataframefeature-engineering

Pandas - most recent match relative to current row


I would like to add a new column to my dataframe that contains the most recent 'revenue' value where 'promotion' == 1, excluding the current row. The dataframe will always be sorted by 'day' in descending order. For rows near the bottom of the dataframe where there is not a previous row with 'promotion' == 1, it should ideally return nothing (but this is not a hard requirement as we will generally be most interested in recent records).

So if this is my dataframe:

import pandas as pd
import numpy as np
df = pd.DataFrame({'day':[10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
                   'revenue':[12000, 13000, 17000, 14000, 16000,
                              15000, 19000, 17000, 12000, 11000],
                  'promotion':[0, 0, 1, 1, 0, 0, 1, 0, 1, 0]})

The new column would look like:

df['last_promo_rev'] = [17000, 17000, 14000, 19000, 19000, 19000, 12000, 12000, np.nan, np.nan]

Solution

  • This might work. We can use shift. This starts with your basic DataFrame example.

    # Create column with valid promotion values set
    df.loc[:, 'last_promo_rev'] = df.loc[df["promotion"] == 1, "revenue"]
    
    # Shift the column by -1 to move the row up one, then backfill with revenue values.
    df.loc[:, 'last_promo_rev'] = df.loc[:, "last_promo_rev"].shift(-1).bfill(axis="rows")
    

    Output:

       day  revenue  promotion  last_promo_rev
    0   10    12000          0         17000.0
    1    9    13000          0         17000.0
    2    8    17000          1         14000.0
    3    7    14000          1         19000.0
    4    6    16000          0         19000.0
    5    5    15000          0         19000.0
    6    4    19000          1         12000.0
    7    3    17000          0         12000.0
    8    2    12000          1             NaN
    9    1    11000          0             NaN