Search code examples
pythonpandascategorical-datadummy-variable

Pandas one-hot-encode columns to dummies, including an 'other' encoding


My ultimate goal is one-hot-encoding on a Pandas column. In this case, I want to one-hot-encode column "b" as follows: keep apples, bananas and oranges, and encode any other fruit as "other".

Example: in the code below "grapefruit" will be re-written as "other", as would "kiwi"s and "avocado"s if they appeared in my data.

This code below works:

df = pd.DataFrame({
    "a": [1,2,3,4,5],
    "b": ["apple", "banana", "banana", "orange", "grapefruit"],
    "c": [True, False, True, False, True],
})
print(df)

def analyze_fruit(s):
    if s in ("apple", "banana", "orange"):
        return s
    else:
        return "other"

df['b'] = df['b'].apply(analyze_fruit)

df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)

My question: is there a shorter way to do the analyze_fruit() business? I tried DataFrame.replace() with a negative lookahead assertion without success.


Solution

  • You can setup the Categorical before get_dummies then fillna anything that does not match the set categories will become NaN which can be easily filled by fillna. Another Benefit of the categorical is ordering can be defined here as well by adding ordered=True:

    df['b'] = pd.Categorical(
        df['b'],
        categories=['apple', 'banana', 'orange', 'other']
    ).fillna('other')
    
    df2 = pd.get_dummies(df['b'], prefix='b')
    

    Standard replacement with something like np.where would also work here, but typically dummies are used with Categorical data so being able to add ordering so the dummy columns appear in a set order can be helpful:

    # import numpy as np
    
    
    df['b'] = np.where(df['b'].isin(['apple', 'banana', 'orange']),
                       df['b'],
                       'other')
    
    df2 = pd.get_dummies(df['b'], prefix='b')
    

    Both produce df2:

       b_apple  b_banana  b_orange  b_other
    0        1         0         0        0
    1        0         1         0        0
    2        0         1         0        0
    3        0         0         1        0
    4        0         0         0        1