My ultimate goal is one-hot-encoding on a Pandas column. In this case, I want to one-hot-encode column "b" as follows: keep apples, bananas and oranges, and encode any other fruit as "other".
Example: in the code below "grapefruit" will be re-written as "other", as would "kiwi"s and "avocado"s if they appeared in my data.
This code below works:
df = pd.DataFrame({
"a": [1,2,3,4,5],
"b": ["apple", "banana", "banana", "orange", "grapefruit"],
"c": [True, False, True, False, True],
})
print(df)
def analyze_fruit(s):
if s in ("apple", "banana", "orange"):
return s
else:
return "other"
df['b'] = df['b'].apply(analyze_fruit)
df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)
My question: is there a shorter way to do the analyze_fruit()
business? I tried DataFrame.replace()
with a negative lookahead assertion without success.
You can setup the Categorical
before get_dummies
then fillna
anything that does not match the set categories will become NaN
which can be easily filled by fillna
. Another Benefit of the categorical is ordering can be defined here as well by adding ordered=True
:
df['b'] = pd.Categorical(
df['b'],
categories=['apple', 'banana', 'orange', 'other']
).fillna('other')
df2 = pd.get_dummies(df['b'], prefix='b')
Standard replacement with something like np.where
would also work here, but typically dummies are used with Categorical data so being able to add ordering so the dummy columns appear in a set order can be helpful:
# import numpy as np
df['b'] = np.where(df['b'].isin(['apple', 'banana', 'orange']),
df['b'],
'other')
df2 = pd.get_dummies(df['b'], prefix='b')
Both produce df2
:
b_apple b_banana b_orange b_other
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1