Search code examples
pythonpandasdataframecategorical-data

How do I assign 'other' to low frequency categories? (pandas)


I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)

I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....

I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.

I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.


Solution

  • Let's go through the logic of expected actions:

    1. Count frequencies for every city
    2. Calculate the bottom 10% percentage
    3. Find the cities with frequencies less then 10%
    4. Change them to other

    You started in the right direction. To get frequencies for every city:

    city_freq = (df['city'].value_counts())/df.shape[0]
    

    We want to find the bottom 10%. We use pandas' quantile to do it:

    bottom_decile = city_freq.quantile(q=0.1)
    

    Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:

    less_freq_cities = city_freq[city_freq<=botton_decile]
    

    less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":

    df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
    

    complete code:

    city_freq = (df['city'].value_counts())/df.shape[0]
    botton_decile = city_freq.quantile(q=0.1)
    less_freq_cities = city_freq[city_freq<=botton_decile]
    df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
    

    This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.

    EDIT: As suggested in comment, to get normalized frequency it's better use city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use: city_freq = df['city'].value_counts() and it will still work.