Search code examples
pythonpandasdata-cleaning

How to replace a value in a column based on the its value count


I am working on the following dataset: https://drive.google.com/file/d/1UVgSfIO-46aLKHeyk2LuKV6nVyFjBdWX/view?usp=sharing

I am trying to replace the countries in the "Nationality" column whose value_counts() are less than 450 with the value of "Others".

def collapse_category(df):
   df.loc[df['Nationality'].map(df['Nationality'].value_counts(normalize=True)
                                  .lt(450)), 'Nationality'] = 'Others'
   print(df['Nationality'].unique())

This is the code I used but it returns the result as this: ['Others'] Here is the link to my notebook for reference: https://colab.research.google.com/drive/1MfwwBfi9_4E1BaZcPnS7KJjTy8xVsgZO?usp=sharing


Solution

  • Use boolean indexing:

    s = df['Nationality'].value_counts()
    df.loc[df['Nationality'].isin(s[s<450].index), 'Nationality'] = 'Others'
    

    New value_counts after the change:

    FRA       12307
    PRT       11382
    DEU       10164
    GBR        8610
    Others     5354
    ESP        4864
    USA        3398
    ...         ...
    FIN         632
    RUS         578
    ROU         475
    Name: Nationality, dtype: int64