Search code examples
pythonpandasbinning

classifying a column based on other column and append to new column


This is the dataset I have, intuitively, Firm represents the unique firm identifier. y1, y2, y3 represents the number of employees they have. prob_y1, prob_y2, prob_y3 are the probabilities of number of employees. I need to classify prob_y1, prob_y2 and prob_y3 based on the values of y1, y2 and y3. I have attached function to classify those.

   Firm  y1    y2    y3   prob_y1   prob_y2   prob_y3
0     A   1     2     7  0.006897  0.000421  0.002729
1     B   2     3    45  0.013793  0.000632  0.017544
2     C   3     4    40  0.020690  0.000842  0.015595
3     D   4     7     3  0.027586  0.001474  0.001170
4     E   5     9     4  0.034483  0.001895  0.001559
5     F   6   400    12  0.041379  0.084211  0.004678
6     G   7    50    32  0.048276  0.010526  0.012476
7     H   8    70     0  0.055172  0.014737  0.000000
8     I   9    95    76  0.062069  0.020000  0.029630
9     J  10    98     1  0.068966  0.020632  0.000390
10    K  20     2    45  0.137931  0.000421  0.017544
11    L  30    10  2000  0.206897  0.002105  0.779727
12    M  40  4000   300  0.275862  0.842105  0.116959


def func(x):
    """this function is used to compute the bin"""
    if x < 5:
        return "binA"
    elif x >= 5 and x<10:
        return "binB"
    elif x >=10 and x<20:
        return "binC"
    elif x>=20 and x<30:
        return "binD"
    elif x>=30 and x<50:
        return "binE"
    elif x >=50 and x<100:

        return "binF"
    elif x>= 100 and x<200:
        return "binG"

    elif x>=200:
        return "binH"
    else:
        return 'binUnc'

when I run the following code, I get a result set what I am looking forward to.

for i in df:
    if i !='Firm':

        df[i] = df[i].apply(func)

    print(df)

   Firm    y1    y2    y3 prob_y1 prob_y2 prob_y3
0     A  binA  binA  binB    binA    binA    binA
1     B  binA  binA  binE    binA    binA    binA
2     C  binA  binA  binE    binA    binA    binA
3     D  binA  binB  binA    binA    binA    binA
4     E  binB  binB  binA    binA    binA    binA
5     F  binB  binH  binC    binA    binA    binA
6     G  binB  binF  binE    binA    binA    binA
7     H  binB  binF  binA    binA    binA    binA
8     I  binB  binF  binF    binA    binA    binA
9     J  binC  binF  binA    binA    binA    binA
10    K  binD  binA  binE    binA    binA    binA
11    L  binE  binC  binH    binA    binA    binA
12    M  binE  binH  binH    binA    binA    binA

My desired results is as follows:

Firm y1 biny1   y2  binY2   y3  biny3   prob_y1 biny1   prob_y2 biny2   prob_y3 biny3
A   1   binA    2   binA    7   binB    0.0069  binA    0.0004  binA    0.0027  binB
B   2   binA    3   binA    45  binE    0.0138  binA    0.0006  binA    0.0175  binE
C   3   binA    4   binA    40  binE    0.0207  binA    0.0008  binA    0.0156  binE
D   4   …   7   ….  3   binA    0.0276  …   0.0015  …   0.0012  binA
E   5   ….  9   ….  4   ….  0.0345  …   0.0019  ..  0.0016  …

In crux, what I am doing is classifying those probabilities values (prob_y1, prob_y2 and prob_y3) based on values of y1, y2 and y3. the new columns could be appended at the end, I added for demonstrative purpose only.

Solution provided by user3483203 worked for me.


Solution

  • You can use numpy.digitize here.

    Setup

    b = np.array([5, 10, 20, 30, 50, 100, 200, np.inf])
    l = np.array([f'bin{i}' for i in 'ABCDEFGH'])
    # l = np.array(['bin{}'.format(i) for i in 'ABCDEFGH'])
    cols = ('y1', 'y2', 'y3')
    

    Create a helper function using digitize:

    def foo(cols, bins, labels):
        for col in cols:
            yield (f'{col}_bin', labels[np.digitize(df[col], bins)])
            # yield ('{}_bin'.format(col), labels[np.digitize(df[col], bins)])
    

    Use assign to append to the DataFrame:

    df.assign(**dict(foo(cols, b, l)))
    

       Firm  y1    y2    y3   prob_y1   prob_y2   prob_y3 y1_bin y2_bin y3_bin
    0     A   1     2     7  0.006897  0.000421  0.002729   binA   binA   binB
    1     B   2     3    45  0.013793  0.000632  0.017544   binA   binA   binE
    2     C   3     4    40  0.020690  0.000842  0.015595   binA   binA   binE
    3     D   4     7     3  0.027586  0.001474  0.001170   binA   binB   binA
    4     E   5     9     4  0.034483  0.001895  0.001559   binB   binB   binA
    5     F   6   400    12  0.041379  0.084211  0.004678   binB   binH   binC
    6     G   7    50    32  0.048276  0.010526  0.012476   binB   binF   binE
    7     H   8    70     0  0.055172  0.014737  0.000000   binB   binF   binA
    8     I   9    95    76  0.062069  0.020000  0.029630   binB   binF   binF
    9     J  10    98     1  0.068966  0.020632  0.000390   binC   binF   binA
    10    K  20     2    45  0.137931  0.000421  0.017544   binD   binA   binE
    11    L  30    10  2000  0.206897  0.002105  0.779727   binE   binC   binH
    12    M  40  4000   300  0.275862  0.842105  0.116959   binE   binH   binH