I'm having data like this:-
|--------|---------|
| Col1 | Col2 |
|--------|---------|
| X | a,b,c |
|--------|---------|
| Y | a,b |
|--------|---------|
| X | b,d |
|--------|---------|
I want to convert these categorical data to dummy variables. Since my data is large its giving memory error if i'm using get_dummies()
from pandas. I want my result like this:-
|------|------|------|------|------|------|
|Col_X |Col_Y |Col2_a|Col2_b|Col2_c|Col2_d|
|------|------|------|------|------|------|
| 1 | 0 | 1 | 1 | 1 | 0 |
|------|------|------|------|------|------|
| 0 | 1 | 1 | 1 | 0 | 0 |
|------|------|------|------|------|------|
| 1 | 0 | 0 | 1 | 0 | 1 |
|------|------|------|------|------|------|
I have tried to convert Col2 using this but getting MemoryError as data is large and there is lot of variability in col2 too.
So,
1) How can I convert multiple categorical columns into dummy variable?
2) pandas get_dummy() is giving memory error, so how could i handle that?
I would like to give my solution as well. And I would like to thank @James-dellinger for the answer. So here is my approach
df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
'Col2': ['a,b,c', 'a,b', 'b,d']})
df
Col1 Col2
0 X a,b,c
1 Y a,b
2 X b,d
I first split Col2 values and convert it into column values.
df= pd.DataFrame(df['Col2'].str.split(',',3).tolist(),columns = ['Col1','Col2','Col3'])
df
Col1 Col2 Col3
0 a b c
1 a b None
2 b d None
Then I applied dummy creation on this dataframe without giving any prefix.
df=pd.get_dummies(df, prefix="")
df
_a _b _b _d _c
0 1 0 1 0 1
1 1 0 1 0 0
2 0 1 0 1 0
Now to get the desired result we can sum up all the duplicate columns.
df.groupby(level=0, axis=1).sum()
df
_a _b _c _d
0 1 1 1 0
1 1 1 0 0
2 0 1 0 1
For Col1 we can directly create dummy variables using pd.get_dummies()
and store it into different dataframe suppose col1_df
. We can concat both columns using pd.concat([df,col1_df], axis=1, sort=False)