Search code examples
pythonpandasscikit-learndummy-variable

Convert categorical data into dummy set


I'm having data like this:-

|--------|---------|
| Col1   | Col2    |
|--------|---------|
| X      | a,b,c   |
|--------|---------|
| Y      | a,b     |
|--------|---------|
| X      | b,d     |
|--------|---------|

I want to convert these categorical data to dummy variables. Since my data is large its giving memory error if i'm using get_dummies() from pandas. I want my result like this:-

|------|------|------|------|------|------|
|Col_X |Col_Y |Col2_a|Col2_b|Col2_c|Col2_d|
|------|------|------|------|------|------|
|  1   |  0   |  1   |  1   |  1   |  0   |
|------|------|------|------|------|------|
|  0   | 1    |  1   |  1   |  0   |   0  |
|------|------|------|------|------|------|
|  1   | 0    |  0   |  1   |  0   |   1  |
|------|------|------|------|------|------|

I have tried to convert Col2 using this but getting MemoryError as data is large and there is lot of variability in col2 too.

So,

1) How can I convert multiple categorical columns into dummy variable?

2) pandas get_dummy() is giving memory error, so how could i handle that?


Solution

  • I would like to give my solution as well. And I would like to thank @James-dellinger for the answer. So here is my approach

    df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
                   'Col2': ['a,b,c', 'a,b', 'b,d']})
    df
    
      Col1  Col2
    0   X   a,b,c
    1   Y   a,b
    2   X   b,d
    

    I first split Col2 values and convert it into column values.

    df= pd.DataFrame(df['Col2'].str.split(',',3).tolist(),columns = ['Col1','Col2','Col3'])
    
    df
    
       Col1 Col2 Col3
    0   a   b    c
    1   a   b    None
    2   b   d    None
    

    Then I applied dummy creation on this dataframe without giving any prefix.

    df=pd.get_dummies(df, prefix="")
    
    df
    
        _a  _b  _b  _d  _c
    0   1   0   1   0   1
    1   1   0   1   0   0
    2   0   1   0   1   0
    

    Now to get the desired result we can sum up all the duplicate columns.

    df.groupby(level=0, axis=1).sum()
    
    df
    
        _a  _b  _c  _d
    0   1   1   1   0
    1   1   1   0   0
    2   0   1   0   1
    

    For Col1 we can directly create dummy variables using pd.get_dummies() and store it into different dataframe suppose col1_df. We can concat both columns using pd.concat([df,col1_df], axis=1, sort=False)