Search code examples
pandasdataframesortingpandas-groupbypercentile

`groupby` - `qcut` but with condition


I have a dataframe as follow:

   key1 key2  val
0     a    x    8
1     a    x    6
2     a    x    7
3     a    x    4
4     a    x    9
5     a    x    1
6     a    x    2
7     a    x    3
8     a    x   10
9     a    x    5
10    a    y    4
11    a    y    9
12    a    y    1
13    a    y    2
14    b    x   17
15    b    x   15
16    b    x   18
17    b    x   19
18    b    x   12
19    b    x   20
20    b    x   14
21    b    x   13
22    b    x   16
23    b    x   11
24    b    y    2
25    b    y    3
26    b    y   10
27    b    y    5
28    b    y    4
29    b    y   24
30    b    y   22

What I need to do is:

  1. Access each group by key1
  2. In each group of key1, I need to do qcut on observations that key2 == x
  3. For those observation that is out of bin range, assign them to lowest and highest bins

According to the dataframe above, first group key1 = a is from indx=0-13. However, only the indx from 0-9 are used to create bins(threshold). The bins(threshold) is then applied from indx=0-13

Then for second group key1 = b is from indx=14-30. Only indx from 14-23 are used to creates bins(threshold). The bins(threshold) is then applied from indx=14-30.

However, from indx=24-28 and indx=29-30, they are out of bins range. Then for indx=24-28 assign to smallest bin range, indx=29-30 assign to the largest bin range.

The output looks like this:

   key1 key2  val  labels
0     a    x    8          1
1     a    x    6          1
2     a    x    7          1
3     a    x    4          0
4     a    x    9          1
5     a    x    1          0
6     a    x    2          0
7     a    x    3          0
8     a    x   10          1
9     a    x    5          0
10    a    y    4          0
11    a    y    9          1
12    a    y    1          0
13    a    y    2          0
14    b    x   17          1
15    b    x   15          0
16    b    x   18          1
17    b    x   19          1
18    b    x   12          0
19    b    x   20          1
20    b    x   14          0
21    b    x   13          0
22    b    x   16          1
23    b    x   11          0
24    b    y    2          0
25    b    y    3          0
26    b    y   10          0
27    b    y    5          0
28    b    y    4          0
29    b    y   24          1
30    b    y   22          1

My solution: I creates a dict to contain bins as: (for simplicity, take qcut=2)

dict_bins = {}
key_unique = data['key1'].unique()
for k in key_unique:
    sub = data[(data['key1'] == k) & (data['key2'] == 'x')].copy()
    dict_bins[k] = pd.qcut(sub['val'], 2, labels=False, retbins=True )[1]

Then, I intend to use groupby with apply, but get stuck on accessing dict_bins

data['sort_key1'] = data.groupby(['key1'])['val'].apply(lambda g: --- stuck---)

Any other solution, or modification to my solution is appreciated.

Thank you


Solution

  • A first approach is to create a custom function:

    def discretize(df):
        bins = pd.qcut(df.loc[df['key2'] == 'x', 'val'], 2, labels=False, retbins=True)[1]
        bins = [-np.inf] + bins[1:-1].tolist() + [np.inf]
        return pd.cut(df['val'], bins, labels=False)
    
    df['label'] = df.groupby('key1').apply(discretize).droplevel(0)
    

    Output:

    >>> df
       key1 key2  val  label
    0     a    x    8      1
    1     a    x    6      1
    2     a    x    7      1
    3     a    x    4      0
    4     a    x    9      1
    5     a    x    1      0
    6     a    x    2      0
    7     a    x    3      0
    8     a    x   10      1
    9     a    x    5      0
    10    a    y    4      0
    11    a    y    9      1
    12    a    y    1      0
    13    a    y    2      0
    14    b    x   17      1
    15    b    x   15      0
    16    b    x   18      1
    17    b    x   19      1
    18    b    x   12      0
    19    b    x   20      1
    20    b    x   14      0
    21    b    x   13      0
    22    b    x   16      1
    23    b    x   11      0
    24    b    y    2      0
    25    b    y    3      0
    26    b    y   10      0
    27    b    y    5      0
    28    b    y    4      0
    29    b    y   24      1
    30    b    y   22      1
    

    You need to drop the first level of index to align indexes:

    >>> df.groupby('key1').apply(discretize)
    key1  # <- you have to drop this index level
    a     0     1
          1     1
          2     1
          3     0
          4     1
          5     0
          6     0
          7     0
          8     1
          9     0
          10    0
          11    1
          12    0
          13    0
    b     14    1
          15    0
          16    1
          17    1
          18    0
          19    1
          20    0
          21    0
          22    1
          23    0
          24    0
          25    0
          26    0
          27    0
          28    0
          29    1
          30    1
    Name: val, dtype: int64