I wrote the following function to convert variable to dummies:
def convert_to_dummies(df, column):
dummies = pd.get_dummies(df[column])
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1) #when dropping column don't forget "axis=1"
return df
But when I apply this to categorical variables in a df:
for col in ['col1', 'col2', ....]:
convert_to_dummies(df, col)
* 'col1', ''col2', ... are categorical columns in df.
I got the original df, and none of the categorical variables are converted to dummies. What did I do wrong?
You need assign output back:
for col in ['col1', 'col2', ....]:
df = convert_to_dummies(df, col)
Sample:
df = pd.DataFrame({'col1':list('abcdef'),
'col2':list('abadec'),
'col3':list('aaadee'),
'col4':list('aabbcc')})
print (df)
col1 col2 col3 col4
0 a a a a
1 b b a a
2 c a a b
3 d d d b
4 e e e c
5 f c e c
for col in ['col1', 'col2']:
df = convert_to_dummies(df, col)
print (df)
col3 col4 a b c d e f a b c d e
0 a a 1 0 0 0 0 0 1 0 0 0 0
1 a a 0 1 0 0 0 0 0 1 0 0 0
2 a b 0 0 1 0 0 0 1 0 0 0 0
3 d b 0 0 0 1 0 0 0 0 0 1 0
4 e c 0 0 0 0 1 0 0 0 0 0 1
5 e c 0 0 0 0 0 1 0 0 1 0 0
If need unique categorical columns better is remove loop:
def convert_to_dummies_cols(df, cols):
#create all dummies once with all columns selected by subset
dummies = pd.get_dummies(df[cols], prefix='', prefix_sep='')
#aggregate max by columns
dummies = dummies.groupby(level=0, axis=1).max()
#add to original df
df = pd.concat([df, dummies], axis=1)
df = df.drop(cols, axis=1)
return df
#parameter is list of columns for dummies
df = convert_to_dummies_cols(df, ['col1', 'col2'])
print (df)
col3 col4 a b c d e f
0 a a 1 0 0 0 0 0
1 a a 0 1 0 0 0 0
2 a b 1 0 1 0 0 0
3 d b 0 0 0 1 0 0
4 e c 0 0 0 0 1 0
5 e c 0 0 1 0 0 1