Consider the following pandas dataframe which contains info about the usage of the public pool by a single user.
He can only use the service twice per week, so when he reaches 2 usages, the rest of the days within the same week are automatically 0. How can I delete this irrelevant rows, i.e., all the rows that fulfil df['Attended']==0 because the threshold has been reached?
Week Day Attended?
1 1 0
1 2 1
1 3 1
1 4 0
1 5 0
2 1 1
2 2 1
2 3 0
2 4 0
2 5 0
3 1 0
3 2 0
3 3 0
3 4 0
3 5 1
As an example, the expected output would delete days 4,5 from week1 as the threshold is reached on day 3. In the same manner, the threshold is reached on day 2 for week 2 so day 3,4 and 5 are deleted.
Week Day Attended?
1 1 0
1 2 1
1 3 1
2 1 1
2 2 1
3 1 0
3 2 0
3 3 0
3 4 0
3 5 1
Form a Boolean mask with cumsum
to find the 0
s after the first 2 attendances. Subset with .loc
m = df.groupby('Week')['Attended?'].cumsum().ge(2) & df['Attended?'].eq(0)
df.loc[~m]
Week Day Attended?
0 1 1 0
1 1 2 1
2 1 3 1
5 2 1 1
6 2 2 1
10 3 1 0
11 3 2 0
12 3 3 0
13 3 4 0
14 3 5 1
To do this with a variable threshold per week, as in your comments, we map
then can use the same logic, checking against the created column, not a scalar value.
thresh = [1, 2, 2] # Usages per week to drop after
week = [1, 2, 3]
df['threshold'] = df['Week'].map(dict(zip(week, thresh)))
m = df.groupby('Week')['Attended?'].cumsum().ge(df['threshold']) & df['Attended?'].eq(0)
df.loc[~m]
Week Day Attended? threshold
0 1 1 0 1
1 1 2 1 1
2 1 3 1 1
5 2 1 1 2
6 2 2 1 2
10 3 1 0 2
11 3 2 0 2
12 3 3 0 2
13 3 4 0 2
14 3 5 1 2