Search code examples
pythonpandassequential

Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)


I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.

I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:

import pandas as pd

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])


for index, row in df.iterrows():
    if index == 0:
        df.loc[index, 'outcome'] = df.loc[index, 'quantity']
    else:
        previous_outcome = df.loc[index-1, 'outcome'] 
        if previous_outcome > 0:
            previous_outcome = 0

        df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']

print(df)

#   quantity    outcome
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   15          11.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   5            1.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   15          14.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0

Is there faster (more optimized way) to calculate this?

I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.


Solution

  • Iterating over DataFrame rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas.

    For you case, looking at your DataFrame column quantity as a numpy array, the code below should speed up the process quite a lot compared to your approach:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
    
    x = np.array(df.quantity)
    y = np.zeros(x.size)
    
    total = 0
    for i, xi in enumerate(x):
        total += xi
        y[i] = total
        total = total if total < 0 else 0
    
    df['outcome'] = y
    
    print(df)
    

    Out :

        quantity  outcome
    0         -1     -1.0
    1         -1     -2.0
    2         -1     -3.0
    3         -1     -4.0
    4         15     11.0
    5         -1     -1.0
    6         -1     -2.0
    7         -1     -3.0
    8         -1     -4.0
    9          5      1.0
    10        -1     -1.0
    11        15     14.0
    12        -1     -1.0
    13        -1     -2.0
    14        -1     -3.0
    

    If you still need more speed, suggest to have a look at numba as per jezrael answer.

    Edit - Performance test

    I got curious about performance and did this module with all 3 approaches.

    I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.

    """
    bench_dataframe.py
    Performance test of iteration over DataFrame rows.
    
    Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
    and same using `numba`.
    """
    from numba import njit
    import pandas as pd
    import numpy as np
    
    
    def pditerrows(df):
        """Iterate over DataFrame using `iterrows`"""
    
        for index, row in df.iterrows():
            if index == 0:
                df.loc[index, 'outcome'] = df.loc[index, 'quantity']
            else:
                previous_outcome = df.loc[index-1, 'outcome'] 
                if previous_outcome > 0:
                    previous_outcome = 0
    
                df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
                
        return df
    
    
    def nparray(df):
        """Convert DataFrame column to `numpy` arrays."""
    
        x = np.array(df.quantity)
        y = np.zeros(x.size)
    
        total = 0
        for i, xi in enumerate(x):
            total += xi
            y[i] = total
            total = total if total < 0 else 0
        
        df['outcome'] = y
        
        return df
    
    
    @njit
    def f(x, lim):
        result = np.empty(len(x))
        result[0] = x[0]
    
        for i, j in enumerate(x[1:], 1):
            previous_outcome = result[i-1]
            if previous_outcome > lim:
                previous_outcome = 0
            result[i] = previous_outcome + x[i]
        return result
    
    def numbaloop(df):
        """Convert DataFrame to `numpy` arrays and loop using `numba`.
        See [https://stackoverflow.com/a/69750009/5069105]
        """
        df['outcome'] = f(df.quantity.to_numpy(), 0)
        return df
    
    def create_df(size):
        """Create a DataFrame filed with -1's and 15's, with 90% of 
        the entries equal to -1 and 10% equal to 15, randomly 
        placed in the array.
        """
        df = pd.DataFrame(
                np.random.choice(
                    (-1, 15), 
                    size=size, 
                    p=[0.9, 0.1]
                ),
                columns=['quantity'])
        return df
    
    
    # Make sure all tests lead to the same result
    df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
                      columns=['quantity'])
    assert nparray(df.copy()).equals(pditerrows(df.copy()))
    assert nparray(df.copy()).equals(numbaloop(df.copy()))
    

    Running for a somewhat small array, size = 20_000, leads to:

    In: import bench_dataframe as bd
     .. df = bd.create_df(size=20_000)
    
    In: %timeit bd.pditerrows(df.copy())
    7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In: %timeit bd.nparray(df.copy())
    9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In: %timeit bd.numbaloop(df.copy())
    437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Here numpy arrays were 700+ times faster than iterrows(), and numba was still 22 times faster than numpy.

    And for larger arrays, size = 200_000, we get:

    In: import bench_dataframe as bd
     .. df = bd.create_df(size=200_000)
    
    In: %timeit bd.pditerrows(df.copy())
    I gave up and hit Ctrl+C after 10 minutes or so... =P
    
    In: %timeit bd.nparray(df.copy())
    86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In: %timeit bd.numbaloop(df.copy())
    3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Making numba again 25+ times faster than numpy arrays for this example, and confirming that you should avoid at all costs to use iterrows() for anything more than a couple of hundreds of rows.