I have several columns in a dataframe- each with several factors/levels in it (10+) . In every column, 3-4 factors make up 85-90% of the values. I have several columns in the data. Going through each column and making dummy variables of the top 3-4 would take a lot of time. Simply putting get_dummies would increase size of data exponentially. Is there any useful way that can be suggested in which I can automatically take the top 3-4 factors as dummy variables pushing the rest into ‘Others’ category , for each column? I am using python
You could find the nlargest
by column, and replace values not in the top 3 with other as you are creating your dummies.
import pandas as pd
df = pd.DataFrame({'type':['a','a','a','b','b','b','c','d','e'],
'size': ['s','s','s','m','m','s','l','l','xl']})
for col in ['type','size']:
df = pd.concat([df,
pd.get_dummies(df[col].replace(df.loc[~df[col].isin(df[col].value_counts().nlargest(3).index)][col].unique(),
'other'),
prefix=col)],
axis=1)
Output
type size type_a type_b type_c type_other size_l size_m size_other \
0 a s 1 0 0 0 0 0 0
1 a s 1 0 0 0 0 0 0
2 a s 1 0 0 0 0 0 0
3 b m 0 1 0 0 0 1 0
4 b m 0 1 0 0 0 1 0
5 b s 0 1 0 0 0 0 0
6 c l 0 0 1 0 1 0 0
7 d l 0 0 0 1 1 0 0
8 e xl 0 0 0 1 0 0 1
size_s
0 1
1 1
2 1
3 0
4 0
5 1
6 0
7 0
8 0