Search code examples
pythonpandasdataframegroup-by

Cumulative subtraction in Pandas Dataframe?


I am looking for a way to achieve column total_diff as shown in this table.

If you take John for example, total trainging_hours is 250. No hours were recorded in month1 to month4, so total_diff remained 250.

In month5 50 hours was recorded so total_diff for month5 is (250-50) = 200, month6(200-60) = 140 and so on. Same approach for the other names. enter image description here

I tried below code but did not work as expected.

mask = ~df['month'].isin(['John_Total', 'Amber_Total', 'Barron_Total'])
df.loc[mask, 'total_diff'] = df[mask].groupby("name")['training_hours'].cumsum()
df['total_diff'] = df.groupby('name')['total_diff'].fillna(df['training_hours'])
df

Any help on how to achieve this in python ?


Solution

  • diff = df["training_hours"].where(
        df["month"].str.contains("Total"), -df["training_hours"]
    )
    df["total_diff"] = diff.groupby(df["name"]).cumsum()
    
          name         month  training_hours  total_diff
    0     John    John_Total             250         250
    1     John        month1               0         250
    2     John        month2               0         250
    3     John        month3               0         250
    4     John        month4               0         250
    5     John        month5              50         200
    6     John        month6              60         140
    7     John        month7               0         140
    8     John        month8              40         100
    9     John        month9              70          30
    10    John       month10               0          30
    11    John       month11              30           0
    12   Amber   Amber_Total             300         300
    13   Amber        month1              60         240
    14   Amber        month2              70         170
    15   Amber        month3               0         170
    16   Amber        month4              50         120
    17   Amber        month5              20         100
    18   Amber        month6              40          60
    19   Amber        month7               0          60
    20   Amber        month8              50          10
    21   Amber        month9              10           0
    22   Amber       month10               0           0
    23   Amber       month11               0           0
    24  Barron  Barron_Total             270         270
    25  Barron        month1               0         270
    26  Barron        month2              70         200
    27  Barron        month3              50         150
    28  Barron        month4              50         100
    29  Barron        month5               0         100
    30  Barron        month6              30          70
    31  Barron        month7               0          70
    32  Barron        month8              40          30
    33  Barron        month9              10          20
    34  Barron       month10               5          15
    35  Barron       month11              15           0
    

    Example data:

    data = {
        'name': ['John', 'John', 'John', 'John', 'John', 'John', 'John', 'John', 'John', 'John', 'John', 'John', 
                 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 'Amber', 
                 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron', 'Barron'],
        'month': ['John_Total', 'month1', 'month2', 'month3', 'month4', 'month5', 'month6', 'month7', 'month8', 'month9', 'month10', 'month11', 
                  'Amber_Total', 'month1', 'month2', 'month3', 'month4', 'month5', 'month6', 'month7', 'month8', 'month9', 'month10', 'month11', 
                  'Barron_Total', 'month1', 'month2', 'month3', 'month4', 'month5', 'month6', 'month7', 'month8', 'month9', 'month10', 'month11'],
        'training_hours': [250, 0, 0, 0, 0, 50, 60, 0, 40, 70, 0, 30, 
                           300, 60, 70, 0, 50, 20, 40, 0, 50, 10, 0, 0, 
                           270, 0, 70, 50, 50, 0, 30, 0, 40, 10, 5, 15]
    }
    
    df = pd.DataFrame(data)