Search code examples
pandasdataframescikit-learnsklearn-pandasmultilabel-classification

Multilabel Encoder takes whole value instead of array


I'm working on a dataset with a Tags column extracted from a stackoverflow dataset. I need to encode these tags to perform a tag prediction using a title and body.

I'm stuck with this encoding, can't get what I need.

Here's a preview of my column :

Tags
['python', 'authentication', 'login', 'flask', 'python-2.x']
['c++', 'vector', 'c++11', 'move', 'deque']
...

And what I'm doing so far :

    y_classes = pd.get_dummies(df.Tags)
    y_classes
['.net', 'asp.net-mvc', 'visual-studio', 'asp.net-mvc-4', 'intellisense'] ['.net', 'asp.net-mvc-3', 'linq', 'entity-framework', 'entity-framework-5']
0 0 0
0 0 0
0 0 0

As you can see, I need to have one column for each tag and not for each unique array of tags. I tried multiple solutions found in StackOverflow but none worked

EDIT : I also tried with MultiLabelBinarizer from sklearn.preprocessing and I had a column for each unique character of Tags column

How can I make this works ?


Solution

  • Ok, so I figured out myself how to fix this problem so here is my solution if :

        tags_array=df['Tags'].to_numpy()
        df2 = pd.DataFrame(tags_array, columns=['Tags'])
    
        coun_vect = CountVectorizer()
        count_matrix  = coun_vect.fit_transform(df2["Tags"])
        count_array = count_matrix.toarray()
    
        df2 = pd.DataFrame(data=count_array,columns = 
        coun_vect.get_feature_names())
        print(df2)
    

    output :

    ajax algorithm amazon android angular ...
    0 0 0 1 0 ...
    1 1 0 0 0 ...
    0 0 1 0 1 ...
    ... ... ... ... ... ...

    Edit :

    Like @OllieStanley said in a comment, it could have worked with multilabelBinarizer, the problem was the dataset considered as a list and could be solved by using set or nested list instead