Search code examples
pandasdataframedelete-row

Remove rows that are not relevant for my data set


Consider the following pandas dataframe which contains info about the usage of the public pool by a single user.

He can only use the service twice per week, so when he reaches 2 usages, the rest of the days within the same week are automatically 0. How can I delete this irrelevant rows, i.e., all the rows that fulfil df['Attended']==0 because the threshold has been reached?

Week Day Attended?
1     1      0
1     2      1
1     3      1
1     4      0
1     5      0
2     1      1
2     2      1
2     3      0
2     4      0
2     5      0
3     1      0
3     2      0
3     3      0
3     4      0
3     5      1

As an example, the expected output would delete days 4,5 from week1 as the threshold is reached on day 3. In the same manner, the threshold is reached on day 2 for week 2 so day 3,4 and 5 are deleted.

Week Day Attended?
1     1      0
1     2      1
1     3      1
2     1      1
2     2      1
3     1      0
3     2      0
3     3      0
3     4      0
3     5      1

Solution

  • Form a Boolean mask with cumsum to find the 0s after the first 2 attendances. Subset with .loc

    m = df.groupby('Week')['Attended?'].cumsum().ge(2) & df['Attended?'].eq(0)
    df.loc[~m]
    

        Week  Day  Attended?
    0      1    1          0
    1      1    2          1
    2      1    3          1
    5      2    1          1
    6      2    2          1
    10     3    1          0
    11     3    2          0
    12     3    3          0
    13     3    4          0
    14     3    5          1
    

    To do this with a variable threshold per week, as in your comments, we map then can use the same logic, checking against the created column, not a scalar value.

    thresh = [1, 2, 2]  # Usages per week to drop after
    week = [1, 2, 3]
    
    df['threshold'] = df['Week'].map(dict(zip(week, thresh)))
    
    m = df.groupby('Week')['Attended?'].cumsum().ge(df['threshold']) & df['Attended?'].eq(0)
    df.loc[~m]
    
        Week  Day  Attended?  threshold
    0      1    1          0          1
    1      1    2          1          1
    2      1    3          1          1
    5      2    1          1          2
    6      2    2          1          2
    10     3    1          0          2
    11     3    2          0          2
    12     3    3          0          2
    13     3    4          0          2
    14     3    5          1          2