I have created a small example using skmultilearn trying to do multilabel text classification:
import skmultilearn
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from scipy.sparse import csr_matrix
from pandas.core.common import flatten
from sklearn.naive_bayes import MultinomialNB
from skmultilearn.problem_transform import BinaryRelevance
TRAIN_DATA = [
['How to connect to MySQL using PHP ?', ['development','database']],
['What are the best VPN clients these days?', ['networks']],
['What is the equivalent of the boolean type in Oracle?', ['database']],
['How to remove unwanted entity from Hibernate session?', ['development']],
['How to implement TCP connection pooling in java?', ['development','networks']],
['How can I connect to PostgreSQL database remotely from another network?', ['database','networks']],
['What is the python function to remove accents in a string?', ['development']],
['How to remove indexes in SQL Server?', ['database']],
['How to configure firewall with DMZ?', ['networks']]
]
data_frame = pd.DataFrame(TRAIN_DATA, columns=['text','labels'])
corpus = data_frame['text']
unique_labels = set(flatten(data_frame['labels']))
for u in unique_labels:
data_frame[u] = 0
data_frame[u] = pd.to_numeric(data_frame[u])
for i, row in data_frame.iterrows():
for u in unique_labels:
if u in row.labels:
data_frame.at[i,u] = 1
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
y = data_frame[unique_labels]
binary_rel_clf = BinaryRelevance(MultinomialNB())
binary_rel_clf.fit(Xfeatures,y)
predict_text = ['SQL Server and PHP?']
X_predict = tfidf.transform(predict_text)
br_prediction = binary_rel_clf.predict(X_predict)
print(br_prediction)
However, the result is something like:
(0, 1) 1.
Is there way to convert this result to labels names, something like ['development','database']
?
The return type of BinaryRelevance
estimator is a scipy csc_matrix
. What you could do is the following:
First, convert the csc_matrix
to a dense numpy array of type bool
:
br_prediction = br_prediction.toarray().astype(bool)
Then, use the converted predictions as a mask for possible label names of y
:
predictions = [y.columns.values[prediction].tolist() for prediction in br_prediction]
This will map each prediction to the corresponding labels. For example:
print(y.columns.values)
# output: ['development' 'database' 'networks']
print(br_prediction)
# output: (0, 1) 1
br_prediction = br_prediction.toarray().astype(bool)
print(br_prediction)
# output: [[False True False]]
predictions = [y.columns.values[prediction].tolist() for prediction in br_prediction]
print(predictions)
# output: [['database']]