Search code examples
pythonpandasbinning

Conditionally binning


Is it possible to create a new column in a dataframe where the bins for 'X' are based on a value of another column(s). Example below.

The bins for AR1, PO1 and RU1 are different from one another.

Until now I can only get bins for all values in 'X'.

import pandas as pd
import numpy as np
import string
import random

N = 100
J = [2012,2013,2014]
K = ['A','B','C','D','E','F','G','H']
L = ['h','d','a']
S = ['AR1','PO1','RU1']

np.random.seed(0)

df = pd.DataFrame(
    {'X': np.random.uniform(1,10,N),
     'Y': np.random.uniform(1,10,N),
     'J':np.random.choice(J, N),
     'R':np.random.choice(L, N),
     'S':np.random.choice(S,N)
    })

df['bins_X'] = pd.qcut(df['X'], 10)

print(df.head())

enter image description here

The output I would like to have:

enter image description here

EDIT;

On my real data I get a ValueError: edges being not unique. Can I solve this with i.e. rank? How would I add this to the solution proposed?


Solution

  • Simple use pd.qcut within a groupby on S

    df['bins_X'] = df.groupby('S').X.apply(pd.qcut, q=10, labels=np.arange(10))
    
    df.groupby(['bins_X', 'S']).size().unstack()
    
    S       AR1  PO1  RU1
    bins_X               
    0         3    4    4
    1         3    3    4
    2         3    3    4
    3         2    3    4
    4         3    4    4
    5         3    3    3
    6         2    3    4
    7         3    3    4
    8         3    3    4
    9         3    4    4
    

    Leave of the labels parameter if you want them to have their own unique edges

    df['bins_X'] = df.groupby('S').X.apply(pd.qcut, q=10)