pandas dataframe sorting pandas-groupby percentile

`groupby` - `qcut` but with condition

I have a dataframe as follow:

   key1 key2  val
0     a    x    8
1     a    x    6
2     a    x    7
3     a    x    4
4     a    x    9
5     a    x    1
6     a    x    2
7     a    x    3
8     a    x   10
9     a    x    5
10    a    y    4
11    a    y    9
12    a    y    1
13    a    y    2
14    b    x   17
15    b    x   15
16    b    x   18
17    b    x   19
18    b    x   12
19    b    x   20
20    b    x   14
21    b    x   13
22    b    x   16
23    b    x   11
24    b    y    2
25    b    y    3
26    b    y   10
27    b    y    5
28    b    y    4
29    b    y   24
30    b    y   22

What I need to do is:

Access each group by key1
In each group of key1, I need to do qcut on observations that key2 == x
For those observation that is out of bin range, assign them to lowest and highest bins

According to the dataframe above, first group key1 = a is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13

Then for second group key1 = b is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.

However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.

The output looks like this:

   key1 key2  val  labels
0     a    x    8          1
1     a    x    6          1
2     a    x    7          1
3     a    x    4          0
4     a    x    9          1
5     a    x    1          0
6     a    x    2          0
7     a    x    3          0
8     a    x   10          1
9     a    x    5          0
10    a    y    4          0
11    a    y    9          1
12    a    y    1          0
13    a    y    2          0
14    b    x   17          1
15    b    x   15          0
16    b    x   18          1
17    b    x   19          1
18    b    x   12          0
19    b    x   20          1
20    b    x   14          0
21    b    x   13          0
22    b    x   16          1
23    b    x   11          0
24    b    y    2          0
25    b    y    3          0
26    b    y   10          0
27    b    y    5          0
28    b    y    4          0
29    b    y   24          1
30    b    y   22          1

My solution: I creates a dict to contain bins as: (for simplicity, take qcut=2)

dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
    sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
    dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]

Then, I intend to use groupby with apply, but get stuck on accessing dict_bins

data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)

Any other solution, or modification to my solution is appreciated.

Thank you

Solution

A first approach is to create a custom function:

def discretize(df):
    bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
    bins = [-np.inf] + bins[1:-1].tolist() + [np.inf]
    return pd.cut(df['val'], bins, labels=False)

df['label'] = df.groupby('key1').apply(discretize).droplevel(0)

Output:

>>> df
   key1 key2  val  label
0     a    x    8      1
1     a    x    6      1
2     a    x    7      1
3     a    x    4      0
4     a    x    9      1
5     a    x    1      0
6     a    x    2      0
7     a    x    3      0
8     a    x   10      1
9     a    x    5      0
10    a    y    4      0
11    a    y    9      1
12    a    y    1      0
13    a    y    2      0
14    b    x   17      1
15    b    x   15      0
16    b    x   18      1
17    b    x   19      1
18    b    x   12      0
19    b    x   20      1
20    b    x   14      0
21    b    x   13      0
22    b    x   16      1
23    b    x   11      0
24    b    y    2      0
25    b    y    3      0
26    b    y   10      0
27    b    y    5      0
28    b    y    4      0
29    b    y   24      1
30    b    y   22      1

You need to drop the first level of index to align indexes:

>>> df.groupby('key1').apply(discretize)
key1  # <- you have to drop this index level
a     0     1
      1     1
      2     1
      3     0
      4     1
      5     0
      6     0
      7     0
      8     1
      9     0
      10    0
      11    1
      12    0
      13    0
b     14    1
      15    0
      16    1
      17    1
      18    0
      19    1
      20    0
      21    0
      22    1
      23    0
      24    0
      25    0
      26    0
      27    0
      28    0
      29    1
      30    1
Name: val, dtype: int64