Search code examples
pythonpandasdummy-variable

Filter categories in data frame before generating dummy columns for them


I have a dataset with categorical values in some columns (one row may contain multiple categories, separated by ,). Example:

  user hashtags
0   u1      a,b
1   u2      a,c
2   u3        c

I want to make dummy columns for these categories. I'm also not interested in categories that have very few occurrences in the dataset. Currently, I'm generating the dummy columns and then dropping the ones with few occurrences, like this (chunk is the original data frame):

dummies_hashtags = chunk['hashtags'].str.get_dummies(sep=',')
dummies_hashtags.columns = dummies_hashtags.columns.map(lambda c: 'hashtag_' + c)

# get rid of dummy columns with usage below 10
usage = dummies_hashtags.sum(0)
high_usage = dummies_hashtags[np.where(usage >= 10)[0]]
low_usage = dummies_hashtags[np.where(usage < 10)[0]]
dummies_hashtags = high_usage
dummies_hashtags['other_hashtags'] = low_usage.sum(1)

Notice I'm also adding a column for the number of categories with a low occurrence.

This approach works but is very slow. My idea about how to improve it is to first get all unique categories and their counts, then delete categories with low counts, before generating the dummy columns.

I would like to ask you this: would this approach actually improve anything? How would it be implemented? (np.unique with return_counts=True comes to mind). Also, is there a better approach to this problem?

(Note: The dataset is a SparseDataFrame already).


Solution

  • Use numpy and boolean slicing should speed things up.. let me know if this works for you.

    duh = df.hashtags.str.get_dummies(',')
    v = duh.values
    m = v.sum(0) > 1  # filter out occurrences of 1.  change for your needs
    d2 = pd.DataFrame(v[:, m], duh.index, duh.columns[m])
    
    df.join(d2)
    
      user hashtags  a  c
    0   u1      a,b  1  0
    1   u2      a,c  1  1
    2   u3        c  0  1