I have a df
that includes two categorical variables that have many (>2000) categories. One of the categories of the variable 'installer', makes up a large part of the data:
DWE 35.397980
Government 3.073401
RWE 2.017508
Commu 1.814141
DANIDA 1.757576
...
How can I keep let's say the first three categories and collapse the remaining ones in a category called 'other' without having to list them all?
NOTE: I have used replace before for variables with lower cardinality, but had to list all the values I want to replace.
I've tried:
unique_values= df['installer'].unique()
for unique in unique_values:
if unique != 'DWE' or unique != 'Government' or unique != 'RWE':
df['installer'] = df['installer'].replace(unique, 'other')
What about first grabbing the three that repeat the most using value_counts
, then filtering and assing the 'Other'
categories_to_keep = df['installer'].value_counts().index[:3]
df.loc[
(~df.installer.isin(categories_to_keep)), 'installer'
] = 'Other'