Search code examples
python-2.7pandasdataframesklearn-pandas

Convert this Word DataFrame into Zero One Matrix Format DataFrame in Python Pandas eliminates " "


Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills

Input DataFrame

     user_Id                        skills

 0     user1               "java, hdfs, hadoop"
 1     user2               "python, c++, c"
 2     user3               "hadoop, java, hdfs"
 3     user4               "html, java, php"
 4     user5               "hadoop, php, hdfs"

Desired Output DataFrame

 user_Id       java  c   c++     hadoop  hdfs    python  html    php     

 user1         1     0   0       1       1       0       0       0
 user2         0     1   1       0       0       1       0       0
 user3        1     0   0       1       1       0       0       0
 user4         1     0   0       0       0       0       1       1
 user5         0     0   0       1       1       0       0       1

Solution

  • For me works str.get_dummies + concat:

    df1 = df['skills'].str.get_dummies(', ')
    print (df1)
       c  c++  hadoop  hdfs  html  java  php  python
    0  0    0       1     1     0     1    0       0
    1  1    1       0     0     0     0    0       1
    2  0    0       1     1     0     1    0       0
    3  0    0       0     0     1     1    1       0
    4  0    0       1     1     0     0    1       0
    
    df = pd.concat([df['user_Id'], df1], axis=1)
    print (df)
      user_Id  c  c++  hadoop  hdfs  html  java  php  python
    0   user1  0    0       1     1     0     1    0       0
    1   user2  1    1       0     0     0     0    0       1
    2   user3  0    0       1     1     0     1    0       0
    3   user4  0    0       0     0     1     1    1       0
    4   user5  0    0       1     1     0     0    1       0
    

    EDIT:

    If no space with , use:

    df1 = df['skills'].str.get_dummies(',')