I have a dataframe df
which contains a column named object_name
. The frequency distribution of the values contained in object_name
is as follows:
What I want is to rename as misc
or other
all the values in the column whose presence is less than, say, 5% of the whole column. So, for example, if the length of the dataframe is 50,000 rows, then I want all values in object_name
to be renamed to other
or misc
which have under 2500 occurrences. In the above example, that would mean the first 5 values would remain as is, and the rest of the values are renamed to other
or misc
.
I can probably do that in a convoluted way with one or more for loops. But I wonder if there can be a one-liner to do this task.
A bit long for a one-liner, but it should do the job.
It gets the normalized value_counts
of each item, which is its frequency. Filters to only keep ones with a frequency <0.05
and then creates a mask for the original dataframe of values from that list. We select those items, and change them to equal other
.
df.loc[df.object_name.isin(df.object_name.value_counts(normalize=True)[lambda x: x<0.05].index), 'object_name'] = 'other'