Search code examples
pythonpandasdataframedata-science

Sum of column that resets to zero through out a process


Is there an easy way to go about a total for a column that increments but can reset back to zero through out the dataset? I have started to go down the path of a for loop and keeping track of previous value if it isn't a zero and using multiple variables, but wanted to check if there is a better way to go about it. Here is an example of the data I'm trying to work with:

id Time Extruder1 LineSpeed
157 5/22/2023 10:14:09.229 PM 1560.0 0.0
158 5/22/2023 10:16:28.582 PM 1563.0 0.0
159 5/23/2023 7:17:37.831 AM 1563.0 0.0
160 5/23/2023 7:19:57.184 AM 0.0 0.0
161 5/23/2023 7:33:53.302 AM 2.0 0.0
162 5/23/2023 7:36:12.655 AM 4.0 0.0
163 5/23/2023 7:38:32.008 AM 6.0 0.0
334 5/23/2023 2:15:41.371 PM 789.0 0.0
335 5/23/2023 2:18:00.724 PM 792.0 0.0
336 5/23/2023 2:20:20.077 PM 794.0 0.0
349 5/23/2023 2:50:31.666 PM 2.0 0.0

I would need to get a total of 3 from the first three rows and then get 6 from the next four, and so on. Is there a method in Pandas that will work for this type of column?


Solution

  • Following our conversation, if there is a gap greater than 10, we can consider we have a new group:

    THRESHOLD = 10
    df['Group'] = df['Extruder1'].diff().abs().gt(THRESHOLD).cumsum()
    total = df.groupby('Group')['Extruder1'].agg(lambda x: x.iloc[-1] - x.iloc[0]).sum()
    

    Output:

    >>> total
    14.0
    
    >>> df
         id                       Time  Extruder1  LineSpeed  Group
    0   157  5/22/2023 10:14:09.229 PM     1560.0        0.0      0
    1   158  5/22/2023 10:16:28.582 PM     1563.0        0.0      0
    2   159   5/23/2023 7:17:37.831 AM     1563.0        0.0      0  # 1563 - 1560 = 3
    
    3   160   5/23/2023 7:19:57.184 AM        0.0        0.0      1
    4   161   5/23/2023 7:33:53.302 AM        2.0        0.0      1
    5   162   5/23/2023 7:36:12.655 AM        4.0        0.0      1
    6   163   5/23/2023 7:38:32.008 AM        6.0        0.0      1  # 6 - 0 = 6
    
    7   334   5/23/2023 2:15:41.371 PM      789.0        0.0      2
    8   335   5/23/2023 2:18:00.724 PM      792.0        0.0      2
    9   336   5/23/2023 2:20:20.077 PM      794.0        0.0      2  # 794 - 789 = 5
    
    10  349   5/23/2023 2:50:31.666 PM        2.0        0.0      3  # 2 - 2 = 0
    

    IIUC, you can use:

    df['Group'] = df['Extruder1'].lt(df['Extruder1'].shift()).cumsum()
    # OR
    df['Group'] = df['Extruder1'].diff().lt(0).cumsum()
    
    print(df)
    
    # Output
         id                       Time  Extruder1  LineSpeed  Group
    0   157  5/22/2023 10:14:09.229 PM     1560.0        0.0      0
    1   158  5/22/2023 10:16:28.582 PM     1563.0        0.0      0
    2   159   5/23/2023 7:17:37.831 AM     1563.0        0.0      0
    3   160   5/23/2023 7:19:57.184 AM        0.0        0.0      1
    4   161   5/23/2023 7:33:53.302 AM        2.0        0.0      1
    5   162   5/23/2023 7:36:12.655 AM        4.0        0.0      1
    6   163   5/23/2023 7:38:32.008 AM        6.0        0.0      1
    7   334   5/23/2023 2:15:41.371 PM      789.0        0.0      1
    8   335   5/23/2023 2:18:00.724 PM      792.0        0.0      1
    9   336   5/23/2023 2:20:20.077 PM      794.0        0.0      1
    10  349   5/23/2023 2:50:31.666 PM        2.0        0.0      2
    

    Step by step:

    >>> out = df['Extruder1'].diff()
    0        NaN
    1        3.0
    2        0.0
    3    -1563.0
    4        2.0
    5        2.0
    6        2.0
    7      783.0
    8        3.0
    9        2.0
    10    -792.0
    Name: Extruder1, dtype: float64
    
    >>> out = out.lt(0)
    0     False
    1     False
    2     False
    3      True
    4     False
    5     False
    6     False
    7     False
    8     False
    9     False
    10     True
    Name: Extruder1, dtype: bool
    
    >>> out = out.cumsum()
    0     0
    1     0
    2     0
    3     1
    4     1
    5     1
    6     1
    7     1
    8     1
    9     1
    10    2
    Name: Extruder1, dtype: int64
    

    Another representation (Python>=3.8):

    >>> pd.concat([out := df['Extruder1'],
                   out := out.diff(),
                   out := out.lt(0),
                   out.cumsum()], axis=1,
                  keys=['Extruder1', 'diff', 'lt', 'cumsum'])
    
        Extruder1    diff     lt  cumsum
    0      1560.0     NaN  False       0
    1      1563.0     3.0  False       0
    2      1563.0     0.0  False       0
    3         0.0 -1563.0   True       1
    4         2.0     2.0  False       1
    5         4.0     2.0  False       1
    6         6.0     2.0  False       1
    7       789.0   783.0  False       1
    8       792.0     3.0  False       1
    9       794.0     2.0  False       1
    10        2.0  -792.0   True       2