I'm working on a dataset with a Tags column extracted from a stackoverflow dataset. I need to encode these tags to perform a tag prediction using a title and body.
I'm stuck with this encoding, can't get what I need.
Here's a preview of my column :
Tags |
---|
['python', 'authentication', 'login', 'flask', 'python-2.x'] |
['c++', 'vector', 'c++11', 'move', 'deque'] |
... |
And what I'm doing so far :
y_classes = pd.get_dummies(df.Tags)
y_classes
['.net', 'asp.net-mvc', 'visual-studio', 'asp.net-mvc-4', 'intellisense'] | ['.net', 'asp.net-mvc-3', 'linq', 'entity-framework', 'entity-framework-5'] | |
---|---|---|
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
As you can see, I need to have one column for each tag and not for each unique array of tags. I tried multiple solutions found in StackOverflow but none worked
EDIT : I also tried with MultiLabelBinarizer from sklearn.preprocessing and I had a column for each unique character of Tags column
How can I make this works ?
Ok, so I figured out myself how to fix this problem so here is my solution if :
tags_array=df['Tags'].to_numpy()
df2 = pd.DataFrame(tags_array, columns=['Tags'])
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(df2["Tags"])
count_array = count_matrix.toarray()
df2 = pd.DataFrame(data=count_array,columns =
coun_vect.get_feature_names())
print(df2)
output :
ajax | algorithm | amazon | android | angular | ... |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | ... |
1 | 1 | 0 | 0 | 0 | ... |
0 | 0 | 1 | 0 | 1 | ... |
... | ... | ... | ... | ... | ... |
Edit :
Like @OllieStanley said in a comment, it could have worked with multilabelBinarizer, the problem was the dataset considered as a list and could be solved by using set or nested list instead