I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df)
to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies
on this.
Let's go through the logic of expected actions:
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile
is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities
will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q
param in quantile
) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True)
instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts()
and it will still work.