Is there an easy way to go about a total for a column that increments but can reset back to zero through out the dataset? I have started to go down the path of a for loop and keeping track of previous value if it isn't a zero and using multiple variables, but wanted to check if there is a better way to go about it. Here is an example of the data I'm trying to work with:
id | Time | Extruder1 | LineSpeed |
---|---|---|---|
157 | 5/22/2023 10:14:09.229 PM | 1560.0 | 0.0 |
158 | 5/22/2023 10:16:28.582 PM | 1563.0 | 0.0 |
159 | 5/23/2023 7:17:37.831 AM | 1563.0 | 0.0 |
160 | 5/23/2023 7:19:57.184 AM | 0.0 | 0.0 |
161 | 5/23/2023 7:33:53.302 AM | 2.0 | 0.0 |
162 | 5/23/2023 7:36:12.655 AM | 4.0 | 0.0 |
163 | 5/23/2023 7:38:32.008 AM | 6.0 | 0.0 |
334 | 5/23/2023 2:15:41.371 PM | 789.0 | 0.0 |
335 | 5/23/2023 2:18:00.724 PM | 792.0 | 0.0 |
336 | 5/23/2023 2:20:20.077 PM | 794.0 | 0.0 |
349 | 5/23/2023 2:50:31.666 PM | 2.0 | 0.0 |
I would need to get a total of 3 from the first three rows and then get 6 from the next four, and so on. Is there a method in Pandas that will work for this type of column?
Following our conversation, if there is a gap greater than 10, we can consider we have a new group:
THRESHOLD = 10
df['Group'] = df['Extruder1'].diff().abs().gt(THRESHOLD).cumsum()
total = df.groupby('Group')['Extruder1'].agg(lambda x: x.iloc[-1] - x.iloc[0]).sum()
Output:
>>> total
14.0
>>> df
id Time Extruder1 LineSpeed Group
0 157 5/22/2023 10:14:09.229 PM 1560.0 0.0 0
1 158 5/22/2023 10:16:28.582 PM 1563.0 0.0 0
2 159 5/23/2023 7:17:37.831 AM 1563.0 0.0 0 # 1563 - 1560 = 3
3 160 5/23/2023 7:19:57.184 AM 0.0 0.0 1
4 161 5/23/2023 7:33:53.302 AM 2.0 0.0 1
5 162 5/23/2023 7:36:12.655 AM 4.0 0.0 1
6 163 5/23/2023 7:38:32.008 AM 6.0 0.0 1 # 6 - 0 = 6
7 334 5/23/2023 2:15:41.371 PM 789.0 0.0 2
8 335 5/23/2023 2:18:00.724 PM 792.0 0.0 2
9 336 5/23/2023 2:20:20.077 PM 794.0 0.0 2 # 794 - 789 = 5
10 349 5/23/2023 2:50:31.666 PM 2.0 0.0 3 # 2 - 2 = 0
IIUC, you can use:
df['Group'] = df['Extruder1'].lt(df['Extruder1'].shift()).cumsum()
# OR
df['Group'] = df['Extruder1'].diff().lt(0).cumsum()
print(df)
# Output
id Time Extruder1 LineSpeed Group
0 157 5/22/2023 10:14:09.229 PM 1560.0 0.0 0
1 158 5/22/2023 10:16:28.582 PM 1563.0 0.0 0
2 159 5/23/2023 7:17:37.831 AM 1563.0 0.0 0
3 160 5/23/2023 7:19:57.184 AM 0.0 0.0 1
4 161 5/23/2023 7:33:53.302 AM 2.0 0.0 1
5 162 5/23/2023 7:36:12.655 AM 4.0 0.0 1
6 163 5/23/2023 7:38:32.008 AM 6.0 0.0 1
7 334 5/23/2023 2:15:41.371 PM 789.0 0.0 1
8 335 5/23/2023 2:18:00.724 PM 792.0 0.0 1
9 336 5/23/2023 2:20:20.077 PM 794.0 0.0 1
10 349 5/23/2023 2:50:31.666 PM 2.0 0.0 2
Step by step:
>>> out = df['Extruder1'].diff()
0 NaN
1 3.0
2 0.0
3 -1563.0
4 2.0
5 2.0
6 2.0
7 783.0
8 3.0
9 2.0
10 -792.0
Name: Extruder1, dtype: float64
>>> out = out.lt(0)
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 True
Name: Extruder1, dtype: bool
>>> out = out.cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 2
Name: Extruder1, dtype: int64
Another representation (Python>=3.8):
>>> pd.concat([out := df['Extruder1'],
out := out.diff(),
out := out.lt(0),
out.cumsum()], axis=1,
keys=['Extruder1', 'diff', 'lt', 'cumsum'])
Extruder1 diff lt cumsum
0 1560.0 NaN False 0
1 1563.0 3.0 False 0
2 1563.0 0.0 False 0
3 0.0 -1563.0 True 1
4 2.0 2.0 False 1
5 4.0 2.0 False 1
6 6.0 2.0 False 1
7 789.0 783.0 False 1
8 792.0 3.0 False 1
9 794.0 2.0 False 1
10 2.0 -792.0 True 2