This is an extension to this post.
My DataFrame is:
import pandas as pd
df = pd.DataFrame(
{
'main': ['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y', 'y', 'y'],
'sub': ['c', 'c', 'c', 'd', 'd', 'e', 'e', 'e', 'e', 'f', 'f', 'f', 'f', 'g', 'g', 'g'],
'num_1': [97, 90, 105, 2100, 1000, 101, 110, 222, 90, 100, 99, 90, 2, 92, 95, 93],
'num_2': [100, 100, 100, 102, 102, 209, 209, 209, 209, 100, 100, 100, 100, 90, 90, 90],
'num_3': [99, 110, 110, 110, 110, 222, 222, 222, 222, 150, 101, 200, 5, 95, 95, 100],
'label': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p']
}
)
And this is the expected output. I want to create column result
:
main sub num_1 num_2 num_3 label result
0 x c 97 100 99 a b
1 x c 90 100 110 b b
2 x c 105 100 110 c b
3 x d 2100 102 110 d f
4 x d 1000 102 110 e f
5 x e 101 209 222 f f
6 x e 110 209 222 g f
7 x e 222 209 222 h f
8 x e 90 209 222 i f
9 y f 100 100 150 j k
10 y f 99 100 101 k k
11 y f 90 100 200 l k
12 y f 2 100 5 m k
13 y g 92 90 95 n NaN
14 y g 95 90 95 o NaN
15 y g 93 90 100 p NaN
The mask is:
mask = (
(df.num_1 < df.num_2) &
(df.num_2 < df.num_3)
)
The process starts like this:
a) The groupby column is sub
b) Finding the first row that meets the condition of the mask for each group.
c) Put the value of label
in the result
If there are no rows that meets the condition of the mask, then the groupby column is changed to main
to find the first row of mask. There is condition for this phase:
The previous sub
s should not be considered when using main
as the groupby
column.
An example of the above steps for group d
in the sub column:
a) sub
is the groupby column.
b) There are no rows in the d
group that df.num_2
is between df.num_1
and df.num_3
(the condition of the mask
)
So now for group d
, its main group is searched. However group c
is also in this main group. Since it is before group d
, group c
should not count for this step. So in x
group the first row of the mask
has f
label (101 < 102 < 222).
One thing to note is that for each sub
group num_2
does not change throughout the group. For example for entire group c
num_2
is 100.
This is my attempt based on this answer but it does not work:
def find(g):
# get sub as 0,1,2…
sub = pd.factorize(g['sub'])[0]
# convert inputs to numpy
a = g['num_1'].to_numpy()
b = g.loc[~g['sub'].duplicated(), 'num_2'].to_numpy()
c = g['num_3'].to_numpy()
# form mask
# (a[:, None] > b) -> num_1 > num_2
# (sub[:, None] >= np.arange(len(b))) -> exclude previous groups
m = (a[:, None] < b) & (a[:, None] > c) & (sub[:, None] >= np.arange(len(b)))
# find first True per column
return pd.Series(np.where(m.any(0), a[m.argmax(0)], np.nan)[sub],
index=g.index)
df['result'] = df.groupby('main', group_keys=False).apply(find)
You can update my previous code to use two comparisons, you have to pay attention to use num_2 as columns. Also you need to change the output reference column to "label":
def find(g):
# get sub as 0,1,2…
sub = pd.factorize(g['sub'])[0]
# convert inputs to numpy
n1 = g['num_1'].to_numpy()
n2 = g.loc[~g['sub'].duplicated(), 'num_2'].to_numpy()
n3 = g['num_3'].to_numpy()
# form mask
# (n1[:, None] > n1) -> num_1 > num_2
# (n3[:, None] > n2) -> num_3 > num_2
# (sub[:, None] >= np.arange(len(b))) -> exclude previous groups
m = ((n1[:, None] < n2) & (n3[:, None] > n2)
& (sub[:, None] >= np.arange(len(n2)))
)
# find first True per column
return pd.Series(np.where(m.any(0), g['label'].to_numpy()
[m.argmax(0)], np.nan)[sub],
index=g.index)
df['result'] = df.groupby('main', group_keys=False).apply(find)
Output:
main sub num_1 num_2 num_3 label result
0 x c 97 100 99 a b
1 x c 90 100 110 b b
2 x c 105 100 110 c b
3 x d 2100 102 110 d f
4 x d 1000 102 110 e f
5 x e 101 209 222 f f
6 x e 110 209 222 g f
7 x e 222 209 222 h f
8 x e 90 209 222 i f
9 y f 100 100 150 j k
10 y f 99 100 101 k k
11 y f 90 100 200 l k
12 y f 2 100 5 m k
13 y g 92 90 95 n NaN
14 y g 95 90 95 o NaN
15 y g 93 90 100 p NaN