Search code examples
pythonpandasdataframegroup-by

How can I change the groupby scope to find the first value that meets the conditions of a mask?


This is an extension to this post.

My DataFrame is:

import pandas as pd
df = pd.DataFrame(
    {
        'main': ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y', 'y', 'y'],
        'sub': ['c', 'c', 'c', 'd', 'd', 'e', 'e', 'e', 'e', 'f', 'f', 'f', 'f', 'g', 'g', 'g'],
        'num_1': [97, 90, 105, 2100, 1000, 101, 110, 222, 90, 100, 99, 90, 2, 92, 95, 93],
        'num_2': [100, 100, 100, 102, 102, 209, 209, 209, 209, 100, 100, 100, 100, 90, 90, 90],
        'num_3': [99, 110, 110, 110, 110, 222, 222, 222, 222, 150, 101, 200, 5, 95, 95, 100],
        'label': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p']
    }
)

And this is the expected output. I want to create column result:

   main sub  num_1  num_2  num_3 label result
0     x   c     97    100     99     a      b
1     x   c     90    100    110     b      b
2     x   c    105    100    110     c      b
3     x   d   2100    102    110     d      f
4     x   d   1000    102    110     e      f
5     x   e    101    209    222     f      f
6     x   e    110    209    222     g      f
7     x   e    222    209    222     h      f
8     x   e     90    209    222     i      f
9     y   f    100    100    150     j      k
10    y   f     99    100    101     k      k
11    y   f     90    100    200     l      k
12    y   f      2    100      5     m      k
13    y   g     92     90     95     n    NaN
14    y   g     95     90     95     o    NaN
15    y   g     93     90    100     p    NaN

The mask is:

mask = (
    (df.num_1 < df.num_2) & 
    (df.num_2 < df.num_3)
)

The process starts like this:

a) The groupby column is sub

b) Finding the first row that meets the condition of the mask for each group.

c) Put the value of label in the result

If there are no rows that meets the condition of the mask, then the groupby column is changed to main to find the first row of mask. There is condition for this phase:

The previous subs should not be considered when using main as the groupby column.

An example of the above steps for group d in the sub column:

a) sub is the groupby column.

b) There are no rows in the d group that df.num_2 is between df.num_1 and df.num_3 (the condition of the mask)

So now for group d, its main group is searched. However group c is also in this main group. Since it is before group d, group c should not count for this step. So in x group the first row of the mask has f label (101 < 102 < 222).

One thing to note is that for each sub group num_2 does not change throughout the group. For example for entire group c num_2 is 100.

This is my attempt based on this answer but it does not work:

def find(g):
    # get sub as 0,1,2…
    sub = pd.factorize(g['sub'])[0]
    # convert inputs to numpy
    a = g['num_1'].to_numpy()
    b = g.loc[~g['sub'].duplicated(), 'num_2'].to_numpy()
    c = g['num_3'].to_numpy()
    # form mask
    # (a[:, None] > b) -> num_1 > num_2
    # (sub[:, None] >= np.arange(len(b))) -> exclude previous groups
    m = (a[:, None] < b) &  (a[:, None] > c) & (sub[:, None] >= np.arange(len(b)))
    # find first True per column
    return pd.Series(np.where(m.any(0), a[m.argmax(0)], np.nan)[sub],
                     index=g.index)

df['result'] = df.groupby('main', group_keys=False).apply(find)

Solution

  • You can update my previous code to use two comparisons, you have to pay attention to use num_2 as columns. Also you need to change the output reference column to "label":

    def find(g):
        # get sub as 0,1,2…
        sub = pd.factorize(g['sub'])[0]
        # convert inputs to numpy
        n1 = g['num_1'].to_numpy()
        n2 = g.loc[~g['sub'].duplicated(), 'num_2'].to_numpy()
        n3 = g['num_3'].to_numpy()
        # form mask
        # (n1[:, None] > n1) -> num_1 > num_2
        # (n3[:, None] > n2) -> num_3 > num_2
        # (sub[:, None] >= np.arange(len(b))) -> exclude previous groups
        m = ((n1[:, None] < n2) & (n3[:, None] > n2)
             & (sub[:, None] >= np.arange(len(n2)))
            )
        # find first True per column
        return pd.Series(np.where(m.any(0), g['label'].to_numpy()
                                  [m.argmax(0)], np.nan)[sub],
                         index=g.index)
    
    df['result'] = df.groupby('main', group_keys=False).apply(find)
    

    Output:

       main sub  num_1  num_2  num_3 label result
    0     x   c     97    100     99     a      b
    1     x   c     90    100    110     b      b
    2     x   c    105    100    110     c      b
    3     x   d   2100    102    110     d      f
    4     x   d   1000    102    110     e      f
    5     x   e    101    209    222     f      f
    6     x   e    110    209    222     g      f
    7     x   e    222    209    222     h      f
    8     x   e     90    209    222     i      f
    9     y   f    100    100    150     j      k
    10    y   f     99    100    101     k      k
    11    y   f     90    100    200     l      k
    12    y   f      2    100      5     m      k
    13    y   g     92     90     95     n    NaN
    14    y   g     95     90     95     o    NaN
    15    y   g     93     90    100     p    NaN