Search code examples
pythonpandasdataframeshift

Flanking data by using shift


How precisely works shift() in this code. I'm trying to get some True values in my dataframe and then to extend my selection to the next up and down 4 False values. An example of my DataFrame:

df Out[89]: TRACK_ID FRAME match 290 1667.0 350.0 False 291 1667.0 352.0 False 292 1667.0 353.0 False 293 1667.0 354.0 False 294 1668.0 348.0 False 295 1668.0 349.0 False 296 1668.0 350.0 False 297 1668.0 351.0 True 298 1668.0 352.0 True 299 1668.0 353.0 True 300 449.0 87.0 False 301 449.0 88.0 False 302 449.0 89.0 False 303 449.0 90.0 False 304 449.0 91.0 False 305 449.0 92.0 False

I'm using this line of code to extract True rows and immediately 4 up and down rows:

df1 = df[df.match | df.match.shift(np.round(-4,0)) | df.match.shift(np.round(4,0))] 

However my output is skipping (deleting) the first index up and down (index 296 and 300):

df1 Out[97]: TRACK_ID FRAME match 293 1667.0 354.0 False 294 1668.0 348.0 False 295 1668.0 349.0 False 297 1668.0 351.0 True 298 1668.0 352.0 True 299 1668.0 353.0 True 301 449.0 88.0 False 302 449.0 89.0 False 303 449.0 90.0 False

I can not figure it out why is this happening, any suggestions welcome!


Solution

  • Your condition is a mask that "shifts" values by 4 rows, nothing more.

    df.match | df.match.shift(np.round(-4,0)) | df.match.shift(np.round(4,0))
    #290    False
    #291    False
    #292    False
    #293     True
    #294     True
    #295     True
    #296    False
    #297     True
    #298     True
    #299     True
    #300    False
    #301     True
    #302     True
    #303     True
    #304    False
    #305    False
    

    You are filtering the dataframe with it, so it is "deleting" the rows where your condition is not true. It sounds like you'd rather mark these rows as false, in which case you don't want to filter, you want to update the dataframe

    df['updated_match'] = df.match | df.match.shift(np.round(-4,0)) | df.match.shift(np.round(4,0))
    

    Then df looks like:

       TRACK_ID FRAME   match   updated_match
    290 1667.0  350.0   False   False
    291 1667.0  352.0   False   False
    292 1667.0  353.0   False   False
    293 1667.0  354.0   False   True
    294 1668.0  348.0   False   True
    295 1668.0  349.0   False   True
    296 1668.0  350.0   False   False
    297 1668.0  351.0   True    True
    298 1668.0  352.0   True    True
    299 1668.0  353.0   True    True
    300 449.0   87.0    False   False
    301 449.0   88.0    False   True
    302 449.0   89.0    False   True
    303 449.0   90.0    False   True
    304 449.0   91.0    False   False
    305 449.0   92.0    False   False
    

    EDIT: Re-read the question and realized your issue.

    I think instead of using shift(), you want a rolling maximum over a 4-row window. Use .rolling() in both directions (forward and backward).

    df1 = 
    df[df.match |
       df['match'].iloc[::-1].rolling(window=4).max().fillna(0).astype(bool) |
       df['match'].rolling(window=4).max().fillna(0).astype(bool)
      ]
    

    Output:

    #   TRACK_ID    FRAME   match
    #294    1668.0  348.0   False
    #295    1668.0  349.0   False
    #296    1668.0  350.0   False
    #297    1668.0  351.0   True
    #298    1668.0  352.0   True
    #299    1668.0  353.0   True
    #300    449.0   87.0    False
    #301    449.0   88.0    False
    #302    449.0   89.0    False
    

    This keeps 296 and 300 which you called out.