Search code examples
pandasone-hot-encoding

how to one-hot-encode values in a columns,while treating some values as one same category


I want to one-hot-encode column in a Pandas dataframe. Some values in that column have low occurrence rate thus I would like to treat them as the same category. Is a way to do this by using one-hot-encoder or get_dummies methods? One way I come up with is to replace those values with a dict before encoding. Any suggestion would be highly appreciated.


Solution

  • You can use:

    df = pd.DataFrame({'A':[1,2,3,4,5,6,6,5,4]}).astype(str)
    print (df)
       A
    0  1
    1  2
    2  3
    3  4
    4  5
    5  6
    6  6
    7  5
    8  4
    

    First get all values below treshold with value_counts and boolean indexing and in dict comprehension add same scalar value like 0. Last replace:

    tresh = 2
    s = df['A'].value_counts()
    d = {x:0 for x in s[s < tresh].index}
    print (d)
    {'1': 0, '3': 0, '2': 0}
    
    df = df.replace(d)
    print (df)
       A
    0  0
    1  0
    2  0
    3  4
    4  5
    5  6
    6  6
    7  5
    8  4
    
    print (pd.get_dummies(df, prefix='', prefix_sep=''))
       0  4  5  6
    0  1  0  0  0
    1  1  0  0  0
    2  1  0  0  0
    3  0  1  0  0
    4  0  0  1  0
    5  0  0  0  1
    6  0  0  0  1
    7  0  0  1  0
    8  0  1  0  0