I have a dataset with categorical values in some columns (one row may contain multiple categories, separated by ,
). Example:
user hashtags
0 u1 a,b
1 u2 a,c
2 u3 c
I want to make dummy columns for these categories. I'm also not interested in categories that have very few occurrences in the dataset. Currently, I'm generating the dummy columns and then dropping the ones with few occurrences, like this (chunk
is the original data frame):
dummies_hashtags = chunk['hashtags'].str.get_dummies(sep=',')
dummies_hashtags.columns = dummies_hashtags.columns.map(lambda c: 'hashtag_' + c)
# get rid of dummy columns with usage below 10
usage = dummies_hashtags.sum(0)
high_usage = dummies_hashtags[np.where(usage >= 10)[0]]
low_usage = dummies_hashtags[np.where(usage < 10)[0]]
dummies_hashtags = high_usage
dummies_hashtags['other_hashtags'] = low_usage.sum(1)
Notice I'm also adding a column for the number of categories with a low occurrence.
This approach works but is very slow. My idea about how to improve it is to first get all unique categories and their counts, then delete categories with low counts, before generating the dummy columns.
I would like to ask you this: would this approach actually improve anything? How would it be implemented? (np.unique
with return_counts=True
comes to mind). Also, is there a better approach to this problem?
(Note: The dataset is a SparseDataFrame
already).
Use numpy
and boolean slicing should speed things up.. let me know if this works for you.
duh = df.hashtags.str.get_dummies(',')
v = duh.values
m = v.sum(0) > 1 # filter out occurrences of 1. change for your needs
d2 = pd.DataFrame(v[:, m], duh.index, duh.columns[m])
df.join(d2)
user hashtags a c
0 u1 a,b 1 0
1 u2 a,c 1 1
2 u3 c 0 1