Search code examples
pythonpandasdataframenumba

Pandas fill two columns row by row recursively


I have a column (P0) with missing value that tracks the initial value of a metric and a column that tracks the percentage change (CHG). The goal is to create a new column (P1) that calculate the new metric after the CHG, as well as fill P0 with P1 when there is missing value.

import pandas as pd
df = pd.DataFrame()
df['P0'] = [50, np.nan, 60, np.nan]
df['CHG'] = [0, 0.5, -0.1, 0.2]

So in this example, the recursive calculation would be:

df.loc[1, 'P1'] = df.loc[0, 'P0'] * (1 + df.loc[1, 'CHG']) = 50 * 1.5 = 75

Then because df.loc[1, 'P0'] is missing, we fill it with df.loc[1, 'P0'] = df.loc[1, 'P1']

Next,

df.loc[2, 'P1'] = df.loc[1, 'P0'] * (1 + df.loc[2, 'CHG']) = 75 * 0.9 = 67.5

Now df.loc[2, 'P0'] is not missing, we do not need to fill it.

Finally, df.loc[3, 'P1'] = df.loc[2, 'P0'] * (1 + df.loc[3, 'CHG']) = 60 * 1.2 = 72

Once again, we need to fill df.loc[3, 'P0'] = df.loc[3, 'P1'] = 72

I can probably write this logic into a loop that goes through rows one by one. But I am wondering if there is anything faster given the data has 1 million rows. Or maybe I can use numba to speed up the loop?


Solution

  • A vectorial approach would be:

    df['P1'] = (df
       .assign(val=df['P0'].ffill()) # propagate start values
       # group by stretches starting on non-NA values
       .groupby(df['P0'].notna().cumsum().shift().bfill(),
                group_keys=False)
       # compute the cumulated product with the first (propagated) value
       # as starting point
       .apply(lambda g: g['CHG'].add(1).cumprod().mul(g['val'].iloc[0]))
    )
    
    # restore initial P1 as NaN
    df['P1'].iloc[0] = np.nan
    
    # fill P0 with P1 where NaN
    df['P0'] = df['P0'].fillna(df['P1'])
    

    output:

         P0  CHG    P1
    0  50.0  0.0   NaN
    1  75.0  0.5  75.0
    2  60.0 -0.1  67.5
    3  72.0  0.2  72.0