Search code examples
pythonpandasnumpyrows

Np.where based on the values in the next rows


I have sample dataframe as follows

I just want to mark all the rows where id == 11 preeceded by 2. If the id ==11 and does not preeced by 2 in the immediate rows I want to mark it as 0

Home"   Date       Time      id  Appliance   Banana expected_output output_from the code
1      1/21/2017    1:30:00  11   Apple       0       1                 1
2      1/21/2017    1:45:00  11  Apple        0       1                 1
3      1/21/2017    2:00:00  11  Apple        0       1                 1
4      1/21/2017    2:15:00  2   Banana       1       1                 0
5      1/21/2017    2:30:00  2   Banana       0       0                 0
6      1/21/2017    2:45:00  0   Orange       0       0                 0
7      1/21/2017    3:00:00  1   Peach        0       0                 0
8      1/21/2017    3:15:00  1   Peach        0       0                 0
9      1/21/2017    3:30:00  3   Pineapple    0       0                 0
10     1/21/2017    3:45:00  3   Pineapple    0       0                 0
11     1/21/2017    4:00:00  11  Apple        0       0                 1
12     1/21/2017    4:15:00  11  Apple        0       0                 1
13     1/21/2017    4:30:00  11  Apple        0       0                 1
14     1/21/2017    4:45:00  0   Orange       0       0                 0
15     1/22/2017    3:30:00  1   Peach        0       0                 0
16     1/22/2017    3:45:00  1   Peach        0       0                 0
17     1/22/2017    4:00:00  3   Pineapple    0       0                 0
18     1/22/2017    4:15:00  3   Pineapple    0       0                 0
19     1/22/2017    4:30:00  11  Apple        0       1                 1
20     1/22/2017    4:45:00  11  Apple        0       1                 1
21     1/22/2017    5:00:00  11  Apple        0       1                 1
22     1/22/2017    5:15:00  2   Banana       1       1                 0
23     1/22/2017    5:30:00  2   Banana       1       0                 0

what I have achieved so far

df['Banana'] = np.where((df['id']==2) & (df['id'].shift(+1)==11), 1,
                                   0)

formatted_df['output_from the code'] = np.where((df['id']==11) & (df['id'].shift(-1)==2), 1,
                                    np.where((df['id']==11) & (df['id'].shift(-1)==11), 1,
                                   0))

is there a way to write np.where based on previous row


Solution

  • I have a utility library I maintain called haggis. It has two functions that convert an array of booleans to an array of run markers and vice versa called haggis.math.mask2runs and haggis.math.runs2mask, respectively. The functions are inverses of each other. mask2runs works like this:

    >>> mask2runs([0, 1, 0, 1, 1, 1, 0, 1, 1])
    array([[1, 2],
           [3, 6],
           [7, 9]], dtype=int64)
    

    The first column is the start indices of each run of True elements, and the second column is the exclusive end. You can use this directly:

    # Grab locations of 11s
    runs = mask2runs(df.id == 11)
    # remove the ones that aren't followed by 2
    runs = runs[df.id.iloc[np.clip(runs[:, 1], 0, len(df) - 1)] == 2]
    # Convert remaining runs back to mask
    df.output = runs2mask(runs, n=len(df))
    

    You don't need to import haggis to use these functions. They are actually pretty simple, especially in this limited usecase. The source is here, but you can also summarize them as follows specifically for this problem:

    def mask2runs(mask):
        return np.flatnonzero(np.diff(np.r_[np.int8(0), mask.view(np.int8) np.int8(0)])).reshape(-1, 2)
    
    def runs2mask(runs, n):
        mask = np.zeros(n, dtype=bool)
        view = mask.view(np.int8)
        view[runs[:, 0]] = 1
        view[runs[:-1, 1] if runs[-1, 1] == n else runs[:, 1]] = -1
        np.cumsum(view, out=view)
        return mask