How to group variables according to the number of times they appear in a column?

I'm trying to run a regression in Python using a few categorical variables. So I thought about converting them to dummies. However, I have a problem here. The column I'm working with have a few categories that repeat a lot and others that don't. If I apply dummy conversion I will get like 5000 dummies. So I thought about grouping the categories that appears just a few times. How can I do that? For instance, if you have category 1 appearing 2000 times, category 2 appearing 3000 times, but categories 4, 5 and 6 appearing 30 times each one, category 7 appears 10 times and category 8 appears 5 times. How can I group categories 4, 5, 6, 7 and 8? Do something like: if the category appears less than X times, group them. The result should be group 1, group 2, group_x, and group_x should have the variables that appears less than a specific amount of times.

Solution

It is relatively simple to do this with pandas module. Imagine that df contains your column with categorical :

import pandas as pd
df = pd.DataFrame({ 'x':['a','b','c','a','b','a','d','e','f']})

You can simply count values inside:

cnt = df.x.value_counts()

And create additional column saying to which group belong given row: Notice that all low count categorical values will be in same group.

df['group'] = df['x'].apply( lambda x: cnt[x] if cnt[x] >= 2 else 0  )

You can modify the value of 2 to any threshold below which all categories would be in same group. Finally create dummies variables like this:

dummies = pd.get_dummies(df['group'])

Assuming that you wanted the same count categories (especially low-count ones) to go into same dummies.