Say I have the following data. For each user_id I want to get a cumulative count every time the difference score is <= -2 until it reaches a positive value. The count should then reset to zero and stay at that value until the next <= -2 is encountered for that user, where the count then starts again from 1 until the next positive value is reached. The result should be equivalent to what I have in cum_count.
df = pd.DataFrame({'user_id': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'difference_score': [0.0, 1.0, -2.5, -2.5, -0.5, -1.5, 2.25, 2.25, -1.86, 0.0, -0.33, -1.33, -2.33, 0.0, 1.0, -2.67, -0.67, -0.67, 0.67, -0.33, -0.33, 1.0],
'cum_count': [0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 3, 0, 0, 0, 0]})
user_id difference_score cum_count
0 1 0.0000 0
1 1 1.0000 0
2 1 -2.5000 1
3 1 -2.5000 2
4 1 -0.5000 3
5 1 -1.5000 4
6 1 2.2500 0
7 1 2.2500 0
8 2 -1.8571 0
9 2 0.0000 0
10 2 -0.3333 0
11 2 -1.3333 0
12 2 -2.3333 1
13 2 0.0000 0
14 3 1.0000 0
15 3 -2.6667 1
16 3 -0.6667 2
17 3 -0.6667 3
18 3 0.6667 0
19 3 -0.3333 0
20 3 -0.3333 0
21 3 1.0000 0
EDIT: Based on your solution mozway I have found a couple of instances where the proposed solution does not produce the intended results. My apologies, my example data was not extensive enough. I have now extended the above example data to highlight this.
For example, in the updated dataframe below, row 14 cum_count should be 0 as it is a new user and the difference score is > -2. However, cum_count2 produces a 1.
In addition, once a positive value is reached and the count resets to 0, the next time a value <= -2 is encountered for that user, the count should start again from 1. However, in row 19, cum_count2 continues counting from the previous count.
The updated cum_count column has the values I'm expected to be returned as compared to what the current solution is returning in cum_count2...
df = pd.DataFrame({'user_id': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'difference_score': [0.0, 1.0, -2.5, -2.5, -0.5, -1.5, 2.25, 2.25, -1.86, 0.0, -0.33, -1.33, -2.33, -2.0, -1.0, -2.67, -0.67, -0.67, 0.67, -2.5, -0.33, 1.0],
'cum_count': [0, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 0]})
user_id difference_score cum_count cum_count2
0 1 0.00 0 0
1 1 1.00 0 0
2 1 -2.50 1 1
3 1 -2.50 2 2
4 1 -0.50 3 3
5 1 -1.50 4 4
6 1 2.25 0 0
7 1 2.25 0 0
8 2 -1.86 0 0
9 2 0.00 0 0
10 2 -0.33 0 0
11 2 -1.33 0 0
12 2 -2.33 1 1
13 2 -2.00 2 2
14 3 -1.00 0 1
15 3 -2.67 1 2
16 3 -0.67 2 3
17 3 -0.67 3 4
18 3 0.67 0 0
19 3 -2.50 1 5
20 3 -0.33 2 6
21 3 1.00 0 0
You can use masks to define the start and stop, identify the consecutive matching values and use if to generate a group to perform a cumcount:
m1 = df['difference_score'].ge(0)
m2 = df['difference_score'].le(-2)
m3 = m2.groupby([df['user_id'], m1.cumsum()]).cummax()
group = (m3&~m3.groupby(df['user_id']).shift(fill_value=False)).cumsum()
df['cum_count'] = df.groupby(['user_id', group]).cumcount().add(1).where(m3, 0)
output:
user_id difference_score cum_count m1 m2 m3 group
0 1 0.00 0 True False False 0
1 1 1.00 0 True False False 0
2 1 -2.50 1 False True True 1
3 1 -2.50 2 False True True 1
4 1 -0.50 3 False False True 1
5 1 -1.50 4 False False True 1
6 1 2.25 0 True False False 1
7 1 2.25 0 True False False 1
8 2 -1.86 0 False False False 1
9 2 0.00 0 True False False 1
10 2 -0.33 0 False False False 1
11 2 -1.33 0 False False False 1
12 2 -2.33 1 False True True 2
13 2 -2.00 2 False True True 2
14 3 -1.00 0 False False False 2
15 3 -2.67 1 False True True 3
16 3 -0.67 2 False False True 3
17 3 -0.67 3 False False True 3
18 3 0.67 0 True False False 3
19 3 -2.50 1 False True True 4
20 3 -0.33 2 False False True 4
21 3 1.00 0 True False False 4