Search code examples
pythonpandasdataframeloopsdrop

(Python) Iterating over dataframe doesn't allow me to drop rows


My goal is to cut off (not delete entirely) Eventlogs if the count of activities within a single Case exceeds a certain threshold.

This is the initial df:

  CaseID  ActivityID
0   Case1          11
1   Case1           5
2   Case1           2
3   Case1          23
4   Case1          86
5   Case1          27
7   Case2           0
8   Case2         256
9   Case2           5
10  Case2           7
11  Case3          23
12  Case3         556
13  Case3           4
14  Case3           2
15  Case3          33
16  Case3           5
17  Case3          67
18  Case3          32
19  Case3           5
20  Case3          66

And this my desired outcome if the threshold was set to be 5:

  CaseID  ActivityID
0   Case1          11
1   Case1           5
2   Case1           2
3   Case1          23
4   Case1          86
7   Case2           0
8   Case2         256
9   Case2           5
10  Case2           7
11  Case3          23
12  Case3         556
13  Case3           4
14  Case3           2
15  Case3          33

I came up with this code:

threshold = 5
counter = 0

for i in range(1, len(df)):
    if (df.loc[i, 'CaseID'] == df.loc[i-1, 'CaseID']) & (counter < threshold):
        counter += 1
    elif (df.loc[i, 'CaseID'] == df.loc[i-1, 'CaseID']) & (counter >= threshold):
        df.drop(index=i, inplace=True) # <- that's the problematic line
    elif (df.loc[i, 'CaseID'] != df.loc[i-1, 'CaseID']) & (counter >= threshold):
        counter = 0

If I insert print statements in the df.drop line, the code seems to work exactly as it should. But somehow it doesn't work as soon as soon as I drop rows and either keeps the dataframe as it is or throws KeyError:6.

Looking forward to your help, thank you in advance!


Solution

  • I think groupby.cumcount (cumulative count, indexed from zero) does what you need.

    1. Number each row by its cumulative position within its CaseID group, starting from 0
    2. Filter the DataFrame to keep only rows where cumulative count is between 0 and 4 inclusive
    res = df[df.groupby('CaseID').cumcount() <= 4].copy()
    
    print(res)
    
       CaseID  ActivityID
    0   Case1  11        
    1   Case1  5         
    2   Case1  2         
    3   Case1  23        
    4   Case1  86        
    7   Case2  0         
    8   Case2  256       
    9   Case2  5         
    10  Case2  7         
    11  Case3  23        
    12  Case3  556       
    13  Case3  4         
    14  Case3  2         
    15  Case3  33