Search code examples
pythonpandasdataframegroup-by

How To Remove Specific Rows With Consecutive Values


I have a Pandas dataframe, df_next, that is a monthly aggregation of crime type incidents for specific jurisdictions. For example, something like:

ID Year_Month Total
AL0010000 1991-01 2024
AL0010000 1991-02 3017
... ... ...
WV0550300 2018-11 30147
WV0550300 2018-12 32148

I want to reduce the size of my dataframe by removing rows that are part of 4 months of consecutive 0 values in the 'Total' column. In other words, if a ID has reported 0 total crime for four consecutive months, I want to remove that chunk of 4 months. I want to do this for all IDs.

I've tried:

# Define a window size of 4
window_size = 4

# Apply a rolling window to the Total column for each ID
df_next['Total_rolling'] = df_next.groupby('ID')['Total'].rolling(window=window_size).reset_index(0, drop=True)

df_next['Remove'] = ((df_next['Total_rolling'].shift(window_size - 1) == 0) & (df_next['Total_rolling'] == 0))

# Filter out the rows where there are four consecutive 0's in the Total value for each ID
df_filtered = df_next[~df_next['Remove']]

However, when I check df_filtered, I still have multiple examples of IDs with four consecutive months of 0 crime totals. Any help would be greatly appreciated.


Solution

  • Annotated Code

    # is total zero?
    m = df['Total'] == 0
    
    # create a counter to identify different 
    # blocks of consecutive zero's
    b = (~m).cumsum()
    
    # Group the rows where total is zero by `ID` and above blocks
    # and transform with size to calculate the number of consecutive zeros
    s = df[m].groupby(['ID', b]).transform('size')
    
    # Drop the rows from the original dataframe where
    # there are 4 or more consecutive zeros
    df = df.drop(s.index[s >= 4])