Search code examples
pythonpandascumsum

Pandas: Cumulative sum within group with two conditions


I have a DataFrame that looks like this table:

index x y value_1 cumsum_1 cumsum_2
0 0.1 1 12 12 0
1 1.2 1 10 12 10
2 0.25 1 7 19 10
3 1.0 2 3 0 3
4 0.72 2 5 5 10
5 1.5 2 10 5 13

So my aim is to calculate the cumulative sum of value_1. But there are two conditions that must be taken into account.

  • First: If value x is less than 1 the cumsum() is written in column cumsum_1 and if x is greater in column cumsum_2.
  • Second: column y indicates groups (1,2,3,...). When the value in y changes, the cumsum()-operation start all over again. I think the grouby() method would help.

Does somebody have any idea?


Solution

  • You can use .where() on conditions x < 1 or x >= 1 to temporarily modify the values of value_1 to 0 according to the condition and then groupby cumsum, as follows:

    The second condition is catered by the .groupby function while the first condition is catered by the .where() function, detailed below:

    .where() keeps the column values when the condition is true and change the values (to 0 in this case) when the condition is false. Thus, for the first condition where column x < 1, value_1 will keep its values for feeding to the subsequent cumsum step to accumulate the filtered values of value_1. For rows where the condition x < 1 is False, value_1 has its values masked to 0. These 0 passed to cumsum for accumulation is effectively the same effect as taking out the original values of value_1 for the accumulation into column cumsum_1.

    The second line of codes accumulates value_1 values to column cumsum_2 with the opposite condition of x >= 1. These 2 lines of codes, in effect, allocate value_1 to cumsum_1 and cumsum_2 according to x < 1 and x >= 1, respectively.

    (Thanks for the suggestion of @tdy to simplify the codes)

    df['cumsum_1'] = df['value_1'].where(df['x'] < 1, 0).groupby(df['y']).cumsum()
    df['cumsum_2'] = df['value_1'].where(df['x'] >= 1, 0).groupby(df['y']).cumsum()
    

    Result:

    print(df)
    
          x  y  value_1  cumsum_1  cumsum_2
    0  0.10  1       12        12         0
    1  1.20  1       10        12        10
    2  0.25  1        7        19        10
    3  1.00  2        3         0         3
    4  0.72  2        5         5         3
    5  1.50  2       10         5        13