Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills
Input DataFrame
user_Id skills
0 user1 [java, hdfs, hadoop]
1 user2 [python, c++, c]
2 user3 [hadoop, java, hdfs]
3 user4 [html, java, php]
4 user5 [hadoop, php, hdfs]
Desired Output DataFrame
user_Id java c c++ hadoop hdfs python html php
user1 1 0 0 1 1 0 0 0
user2 0 1 1 0 0 1 0 0
user3 1 0 0 1 1 0 0 0
user4 1 0 0 0 0 0 1 1
user5 0 0 0 1 1 0 0 1
You can join
new DataFrame
created by astype
if need convert lists
to str
(else omit), then remove []
by strip
and use get_dummies
:
df = df[['user_Id']].join(df['skills'].astype(str).str.strip('[]').str.get_dummies(', '))
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
df1 = df['skills'].astype(str).str.strip('[]').str.get_dummies(', ')
#if necessary remove ' from columns names
df1.columns = df1.columns.str.strip("'")
df = pd.concat([df['user_Id'], df1], axis=1)
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0