I'm sitting with a pandas dataframe and I have a time series problem where I have some values called diff. I need to calculate a value, here called sum, according to the below formula for each category separately:
sumn = max(0, diffn + sumn-1 - factor)
factor = 2 (factor is a parameter and in this example set to 2)
The dataframe looks something like this and the value of sum is set to 0 for hour = 0:
category | hour | diff | sum |
---|---|---|---|
a | 0 | 0 | 0 |
a | 1 | 4 | NaN |
a | 2 | 3 | NaN |
a | 3 | 1 | NaN |
b | 0 | 0 | 0 |
b | 1 | 1 | NaN |
b | 2 | -5 | NaN |
b | 3 | 4 | NaN |
My expected output is the following:
category | hour | diff | sum |
---|---|---|---|
a | 0 | 0 | 0 |
a | 1 | 4 | 2 |
a | 2 | 3 | 3 |
a | 3 | 1 | 2 |
b | 0 | 0 | 0 |
b | 1 | 1 | 0 |
b | 2 | -5 | 0 |
b | 3 | 4 | 2 |
Any idea how to solve this? Preferably without iterrows or any for loops since there are a lot of rows.
Would be happy for any help here.
If it would have been without the max function I could have used something like this:
df['sum'] = df.groupby(['category'])['diff'].cumsum() - factor
But the max function messes things up for me.
You can use the following lambda function:
sumn = 0
def calc_sum(df):
global sumn
if not df['hour']: # Reset when hour=0
sumn = 0
sumn = max(0, df['diff'] + sumn - 2)
return sumn
df['sum'] = df.groupby(['category']).apply(lambda df: df.apply(calc_sum, axis=1)).values
Output: