Search code examples
pythonpandasencodingcategorical-data

Collapsing many categories of variable


I have a df that includes two categorical variables that have many (>2000) categories. One of the categories of the variable 'installer', makes up a large part of the data:

DWE                      35.397980
Government                3.073401
RWE                       2.017508
Commu                     1.814141
DANIDA                    1.757576
                           ...    

How can I keep let's say the first three categories and collapse the remaining ones in a category called 'other' without having to list them all?

NOTE: I have used replace before for variables with lower cardinality, but had to list all the values I want to replace.

I've tried:

unique_values= df['installer'].unique()
for unique in unique_values: 
    if unique != 'DWE' or unique != 'Government' or unique != 'RWE': 
        df['installer'] = df['installer'].replace(unique, 'other')
    

Solution

  • What about first grabbing the three that repeat the most using value_counts, then filtering and assing the 'Other'

    categories_to_keep = df['installer'].value_counts().index[:3]
    
    df.loc[
        (~df.installer.isin(categories_to_keep)), 'installer'
    ] = 'Other'