Search code examples
pythonpandascumsum

How to stop and restart cumsum using a marker in another column


I have a pandas dataframe with values that needs to be totalized inside a period for each device, but the periods ends are marked in another column (an easy identifiable event line). The cumsum must go until it finds this end of period marker and then start again from zero(the first value on the next line).

  device_name value end
0   A5         1    False
1   A5         7    False
2   A5         2    True
3   A5         1    False
4   A5         1    False
5   A5         1    False
6   A5         1    True
7   A6         2    False
8   A6         4    False
9   A6         2    False
10  A6         2    True
11  A6         2    False
12  A6         2    False

the result should be something like

  device_name value end     total
0   A5         1    False    1  
1   A5         7    False    8
2   A5         2    True     10
3   A5         1    False    1
4   A5         1    False    2
5   A5         1    False    3
6   A5         1    True     4
7   A6         2    False    2
8   A6         4    False    6
9   A6         2    False    8
10  A6         2    True     10
11  A6         2    False    2
12  A6         2    False    4

I tried cumsum but I can't separate the end of a period from the other, I can iterate over rows and if i find a True on df.end.shift(1) i ignore the last value, but there are at least 60000 and probably there are faster methods, Any suggestion?


Solution

  • IIUC, you want to cumsum per group until you reach a True. Then, after this row, restart the count.

    You can use an extra group based on the "end" value (also using a cumsum):

    df['total'] = (df.groupby(['device_name',
                               df['end'].shift(1, fill_value=0).cumsum()])
                   ['value'].cumsum())
    

    output:

       device_name  value    end  total
    0           A5      1  False      1
    1           A5      7  False      8
    2           A5      2   True     10
    3           A5      1  False      1
    4           A5      1  False      2
    5           A5      1  False      3
    6           A5      1   True      4
    7           A6      2  False      2
    8           A6      4  False      6
    9           A6      2  False      8
    10          A6      2   True     10
    11          A6      2  False      2
    12          A6      2  False      4
    

    NB. note that I get a different value for row #2

    NB.2. for purists, the extra group could also be computed using a groupby. It doesn't really matter in this case. The internal groups will just not start from zero after the first group, but their name will not be used anywhere in the output.