Search code examples
pythonpandasdataframeconditional-statementsvectorization

vectorizing a sequential operation that has a conditional in Pandas


I have a Pandas dataframe with 3 columns. There is a series of booleans, a series of values, and a column that I want to fill, C. I also have an initial value for C.

A         B         C
----------------------
True     10        100
False    20        NaN
True     25        NaN
True     28        NaN
...

I want the values of column C (for C[1:]) to follow the following rule.

if A[i - 1]:
    C[i] = C[i - 1] * B[i] / B[i - 1]
else:
    C[i] = C[i - 1]

Of course this formula cannot determine C[0], but C[0] is provided.

Is there a way to do this efficiently using vectorized operations?

What I've tried:

The following command doesn't account for the sequential nature of the operation.

df.loc[df.A , 'C'] = df.C.shift(1) * df.B / df.B.shift(1)
df.loc[df.A == 0, 'C'] = df.C.shift(1)

If I were to do use an apply function to compute this I would have to probably make new shifted columns like the following, and then only run the apply for rows [1:]? But how do I get the updated previous value of C?

df["s_A"] = df.A.shift(1)
df["s_B"] = df.B.shift(1)
df["s_C"] = df.C.shift(1)
df["s_A"][0] = False; # this assumption is okay within the purposes 

Should this work and is there a faster way? There may be up to 400,000 rows in total across multiple dataframes but it is not super time sensitive.

For clarity's sake I will mention that there are around 12 columns total, but only these three are relevant to this operation.

Is it possible to vectorize this operation? Is there another way it can be solved?

Thanks.


Solution

  • I think it is difficult to vectorize recursive algebra.

    The general way is do it recursively

    A = df['A'].to_numpy()
    B = df['B'].to_numpy()
    C = df['C'].to_numpy()
    
    for i in np.arange(1, len(A)):
        C[i] = C[i-1] if not A[i-1] else C[i-1] * B[i] / B[i-1]
    
    df['A'] = A
    df['B'] = B
    df['C'] = C
    

    Or, after analyzing your case, it can be worked out as a cumulative product problem, which can be solved by:

    df['C'] = df['C'].fillna(
        df['A'].shift(1) * df['B'] / df['B'].shift(1) +\
        (1 - df['A'].shift(1))
    ).cumprod()
    

    Both ways will yield the same result.

           A   B      C
    0   True  10  100.0
    1  False  20  200.0
    2   True  25  200.0
    3   True  28  224.0