This is the dataset I have, intuitively, Firm represents the unique firm identifier. y1, y2, y3 represents the number of employees they have. prob_y1, prob_y2, prob_y3 are the probabilities of number of employees. I need to classify prob_y1, prob_y2 and prob_y3 based on the values of y1, y2 and y3. I have attached function to classify those.
Firm y1 y2 y3 prob_y1 prob_y2 prob_y3
0 A 1 2 7 0.006897 0.000421 0.002729
1 B 2 3 45 0.013793 0.000632 0.017544
2 C 3 4 40 0.020690 0.000842 0.015595
3 D 4 7 3 0.027586 0.001474 0.001170
4 E 5 9 4 0.034483 0.001895 0.001559
5 F 6 400 12 0.041379 0.084211 0.004678
6 G 7 50 32 0.048276 0.010526 0.012476
7 H 8 70 0 0.055172 0.014737 0.000000
8 I 9 95 76 0.062069 0.020000 0.029630
9 J 10 98 1 0.068966 0.020632 0.000390
10 K 20 2 45 0.137931 0.000421 0.017544
11 L 30 10 2000 0.206897 0.002105 0.779727
12 M 40 4000 300 0.275862 0.842105 0.116959
def func(x):
"""this function is used to compute the bin"""
if x < 5:
return "binA"
elif x >= 5 and x<10:
return "binB"
elif x >=10 and x<20:
return "binC"
elif x>=20 and x<30:
return "binD"
elif x>=30 and x<50:
return "binE"
elif x >=50 and x<100:
return "binF"
elif x>= 100 and x<200:
return "binG"
elif x>=200:
return "binH"
else:
return 'binUnc'
when I run the following code, I get a result set what I am looking forward to.
for i in df:
if i !='Firm':
df[i] = df[i].apply(func)
print(df)
Firm y1 y2 y3 prob_y1 prob_y2 prob_y3
0 A binA binA binB binA binA binA
1 B binA binA binE binA binA binA
2 C binA binA binE binA binA binA
3 D binA binB binA binA binA binA
4 E binB binB binA binA binA binA
5 F binB binH binC binA binA binA
6 G binB binF binE binA binA binA
7 H binB binF binA binA binA binA
8 I binB binF binF binA binA binA
9 J binC binF binA binA binA binA
10 K binD binA binE binA binA binA
11 L binE binC binH binA binA binA
12 M binE binH binH binA binA binA
My desired results is as follows:
Firm y1 biny1 y2 binY2 y3 biny3 prob_y1 biny1 prob_y2 biny2 prob_y3 biny3
A 1 binA 2 binA 7 binB 0.0069 binA 0.0004 binA 0.0027 binB
B 2 binA 3 binA 45 binE 0.0138 binA 0.0006 binA 0.0175 binE
C 3 binA 4 binA 40 binE 0.0207 binA 0.0008 binA 0.0156 binE
D 4 … 7 …. 3 binA 0.0276 … 0.0015 … 0.0012 binA
E 5 …. 9 …. 4 …. 0.0345 … 0.0019 .. 0.0016 …
In crux, what I am doing is classifying those probabilities values (prob_y1, prob_y2 and prob_y3) based on values of y1, y2 and y3. the new columns could be appended at the end, I added for demonstrative purpose only.
Solution provided by user3483203 worked for me.
You can use numpy.digitize
here.
Setup
b = np.array([5, 10, 20, 30, 50, 100, 200, np.inf])
l = np.array([f'bin{i}' for i in 'ABCDEFGH'])
# l = np.array(['bin{}'.format(i) for i in 'ABCDEFGH'])
cols = ('y1', 'y2', 'y3')
Create a helper function using digitize
:
def foo(cols, bins, labels):
for col in cols:
yield (f'{col}_bin', labels[np.digitize(df[col], bins)])
# yield ('{}_bin'.format(col), labels[np.digitize(df[col], bins)])
Use assign
to append to the DataFrame:
df.assign(**dict(foo(cols, b, l)))
Firm y1 y2 y3 prob_y1 prob_y2 prob_y3 y1_bin y2_bin y3_bin
0 A 1 2 7 0.006897 0.000421 0.002729 binA binA binB
1 B 2 3 45 0.013793 0.000632 0.017544 binA binA binE
2 C 3 4 40 0.020690 0.000842 0.015595 binA binA binE
3 D 4 7 3 0.027586 0.001474 0.001170 binA binB binA
4 E 5 9 4 0.034483 0.001895 0.001559 binB binB binA
5 F 6 400 12 0.041379 0.084211 0.004678 binB binH binC
6 G 7 50 32 0.048276 0.010526 0.012476 binB binF binE
7 H 8 70 0 0.055172 0.014737 0.000000 binB binF binA
8 I 9 95 76 0.062069 0.020000 0.029630 binB binF binF
9 J 10 98 1 0.068966 0.020632 0.000390 binC binF binA
10 K 20 2 45 0.137931 0.000421 0.017544 binD binA binE
11 L 30 10 2000 0.206897 0.002105 0.779727 binE binC binH
12 M 40 4000 300 0.275862 0.842105 0.116959 binE binH binH