python pandas dataframe categorical-data

How do I assign 'other' to low frequency categories? (pandas)

I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)

I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....

I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.

I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.

Solution

Let's go through the logic of expected actions:

Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other

You started in the right direction. To get frequencies for every city:

city_freq = (df['city'].value_counts())/df.shape[0]

We want to find the bottom 10%. We use pandas' quantile to do it:

bottom_decile = city_freq.quantile(q=0.1)

Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:

less_freq_cities = city_freq[city_freq<=botton_decile]

less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":

df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

complete code:

city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"

This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.

EDIT: As suggested in comment, to get normalized frequency it's better use city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use: city_freq = df['city'].value_counts() and it will still work.