Search code examples
pythonpandasscikit-learnone-hot-encoding

Pandas One hot encoding: Bundling together less frequent categories


I'm doing one hot encoding over a categorical column which has some 18 different kind of values. I want to create new columns for only those values, which appear more than some threshold (let's say 1%), and create another column named other values which has 1 if value is other than those frequent values.

I'm using Pandas with Sci-kit learn. I've explored pandas get_dummies and sci-kit learn's one hot encoder, but can't figure out how to bundle together less frequent values into one column.


Solution

  • plan

    • pd.get_dummies to one hot encode as normal
    • sum() < threshold to identify columns that get aggregated
      • I use pd.value_counts with the parameter normalize=True to get percentage of occurance.
    • join

    def hot_mess2(s, thresh):
        d = pd.get_dummies(s)
        f = pd.value_counts(s, sort=False, normalize=True) < thresh
        if f.sum() == 0:
            return d
        else:
            return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))
    

    Consider the pd.Series s

    s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))
    
    s
    
    0     a
    1     b
    2     b
    3     c
    4     c
    5     c
    6     d
    7     d
    8     d
    9     d
    10    e
    11    e
    12    e
    13    e
    14    e
    15    f
    16    f
    17    f
    18    f
    19    f
    20    f
    dtype: object
    

    hot_mess(s, 0)

        a  b  c  d  e  f
    0   1  0  0  0  0  0
    1   0  1  0  0  0  0
    2   0  1  0  0  0  0
    3   0  0  1  0  0  0
    4   0  0  1  0  0  0
    5   0  0  1  0  0  0
    6   0  0  0  1  0  0
    7   0  0  0  1  0  0
    8   0  0  0  1  0  0
    9   0  0  0  1  0  0
    10  0  0  0  0  1  0
    11  0  0  0  0  1  0
    12  0  0  0  0  1  0
    13  0  0  0  0  1  0
    14  0  0  0  0  1  0
    15  0  0  0  0  0  1
    16  0  0  0  0  0  1
    17  0  0  0  0  0  1
    18  0  0  0  0  0  1
    19  0  0  0  0  0  1
    20  0  0  0  0  0  1
    

    hot_mess(s, .1)

        c  d  e  f  other
    0   0  0  0  0      1
    1   0  0  0  0      1
    2   0  0  0  0      1
    3   1  0  0  0      0
    4   1  0  0  0      0
    5   1  0  0  0      0
    6   0  1  0  0      0
    7   0  1  0  0      0
    8   0  1  0  0      0
    9   0  1  0  0      0
    10  0  0  1  0      0
    11  0  0  1  0      0
    12  0  0  1  0      0
    13  0  0  1  0      0
    14  0  0  1  0      0
    15  0  0  0  1      0
    16  0  0  0  1      0
    17  0  0  0  1      0
    18  0  0  0  1      0
    19  0  0  0  1      0
    20  0  0  0  1      0