Search code examples
machine-learningdeep-learningnaivebayesmultilabel-classification

How to apply MultiOutputClassifier to a dataset for Naive-Bayes algorithm


I have a dataset which is as follows, (it's taken from an article online and I have been trying to Naive Bayesian algorithm on it)

Original Dataset

y attribute

After having done some manipulations (following the article), these are my new datasets for training and testing,

X Train

y Train

Now, it contains a multilabel and I have been asked to look at Multioutput classification for the problem. I have been trying to understand this classification and tried to implement it myself too, but I couldn't get it to done. First of all, I tried following this sample code given on the website,

from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
import numpy as np
X, y1 = make_classification(n_samples=10, n_features=100, n_informative=30, n_classes=3, random_state=1)
y2 = shuffle(y1, random_state=1)
y3 = shuffle(y1, random_state=2)
Y = np.vstack((y1, y2, y3)).T
n_samples, n_features = X.shape # 10,100
n_outputs = Y.shape[1] # 3
n_classes = 3
forest = RandomForestClassifier(n_estimators=100, random_state=1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
multi_target_forest.fit(X, Y).predict(X)

But, since, I am new to all this, I didn't understand anything at all.. I didn't understand why he did the make_classification call, and then shuffled the data and etc. I tried to implement it on my y_train variable and then placed it in my model.fit for Naive-Baysen algorithm,

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
Yt = np.vstack(y_train).T
n_samples, n_features = X_train.shape # 10,100
n_outputs = Yt.shape[1] # 3
n_classes = 3
forest = RandomForestClassifier(n_estimators=100, random_state=1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
model.fit(X_train, multi_target_forest)

But it gave the same error which I was receiving previously, which meant that I didn't do the multioutputclassification properly,

ValueError: y should be a 1d array, got an array of shape () instead.

Can anyone help me in telling how to actually implement this classification, so that the Y variable can be used for the Naive Baysen?


Solution

  • rom sklearn.multioutput import MultiOutputClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.utils import shuffle
    import numpy as np
    

    Then let's assume somehow you make it so you have a set called X_train of shape let's say (600, 8) and then a test set of shape (445, 8) then you have to fit your classifier to your train set and predict y for your test set. Your y_train should have shape (600, 5) and your y_test should have shape (445, 5). (I randomly split the data into train and validation set for you, you can do that easily via https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

    The way you are supposed to fit your classifier is the following

    gauss = GaussianNB()
    multi_target_gauss = MultiOutputClassifier(gauss, n_jobs=-1)
    multi_target_gauss.fit(X_train, y_train)
    multi_target_gauss.predict(X_test)
    

    to get your predictions