Search code examples
pythonpandascounterfrequencyfrequency-analysis

Pandas one-liner to group all values together that constitute under a threshold of the total volume


I have a dataframe df which contains a column named object_name. The frequency distribution of the values contained in object_name is as follows:

enter image description here

What I want is to rename as misc or other all the values in the column whose presence is less than, say, 5% of the whole column. So, for example, if the length of the dataframe is 50,000 rows, then I want all values in object_name to be renamed to other or misc which have under 2500 occurrences. In the above example, that would mean the first 5 values would remain as is, and the rest of the values are renamed to other or misc.

I can probably do that in a convoluted way with one or more for loops. But I wonder if there can be a one-liner to do this task.


Solution

  • A bit long for a one-liner, but it should do the job.

    It gets the normalized value_counts of each item, which is its frequency. Filters to only keep ones with a frequency <0.05 and then creates a mask for the original dataframe of values from that list. We select those items, and change them to equal other.

    df.loc[df.object_name.isin(df.object_name.value_counts(normalize=True)[lambda x: x<0.05].index), 'object_name'] = 'other'