Search code examples
pandasdataframemaxcumsum

Pandas dataframe, a cumsum calculation including max function


I'm sitting with a pandas dataframe and I have a time series problem where I have some values called diff. I need to calculate a value, here called sum, according to the below formula for each category separately:

sumn = max(0, diffn + sumn-1 - factor)

factor = 2 (factor is a parameter and in this example set to 2)

The dataframe looks something like this and the value of sum is set to 0 for hour = 0:

category hour diff sum
a 0 0 0
a 1 4 NaN
a 2 3 NaN
a 3 1 NaN
b 0 0 0
b 1 1 NaN
b 2 -5 NaN
b 3 4 NaN

My expected output is the following:

category hour diff sum
a 0 0 0
a 1 4 2
a 2 3 3
a 3 1 2
b 0 0 0
b 1 1 0
b 2 -5 0
b 3 4 2

Any idea how to solve this? Preferably without iterrows or any for loops since there are a lot of rows.

Would be happy for any help here.

If it would have been without the max function I could have used something like this:

df['sum'] = df.groupby(['category'])['diff'].cumsum() - factor

But the max function messes things up for me.


Solution

  • You can use the following lambda function:

    sumn = 0
    def calc_sum(df):
        global sumn
        if not df['hour']: # Reset when hour=0
            sumn = 0
        sumn = max(0, df['diff'] + sumn - 2)
        return sumn
    
    df['sum'] = df.groupby(['category']).apply(lambda df: df.apply(calc_sum, axis=1)).values
    

    Output:

    enter image description here