I have a dataset which is as follows, (it's taken from an article online and I have been trying to Naive Bayesian algorithm on it)
After having done some manipulations (following the article), these are my new datasets for training and testing,
Now, it contains a multilabel and I have been asked to look at Multioutput classification for the problem. I have been trying to understand this classification and tried to implement it myself too, but I couldn't get it to done. First of all, I tried following this sample code given on the website,
from sklearn.datasets import make_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
import numpy as np
X, y1 = make_classification(n_samples=10, n_features=100, n_informative=30, n_classes=3, random_state=1)
y2 = shuffle(y1, random_state=1)
y3 = shuffle(y1, random_state=2)
Y = np.vstack((y1, y2, y3)).T
n_samples, n_features = X.shape # 10,100
n_outputs = Y.shape[1] # 3
n_classes = 3
forest = RandomForestClassifier(n_estimators=100, random_state=1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
multi_target_forest.fit(X, Y).predict(X)
But, since, I am new to all this, I didn't understand anything at all.. I didn't understand why he did the make_classification call, and then shuffled the data and etc. I tried to implement it on my y_train variable and then placed it in my model.fit for Naive-Baysen algorithm,
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
Yt = np.vstack(y_train).T
n_samples, n_features = X_train.shape # 10,100
n_outputs = Yt.shape[1] # 3
n_classes = 3
forest = RandomForestClassifier(n_estimators=100, random_state=1)
multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
model.fit(X_train, multi_target_forest)
But it gave the same error which I was receiving previously, which meant that I didn't do the multioutputclassification properly,
ValueError: y should be a 1d array, got an array of shape () instead.
Can anyone help me in telling how to actually implement this classification, so that the Y variable can be used for the Naive Baysen?
rom sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
import numpy as np
Then let's assume somehow you make it so you have a set called X_train of shape let's say (600, 8) and then a test set of shape (445, 8) then you have to fit your classifier to your train set and predict y for your test set. Your y_train should have shape (600, 5) and your y_test should have shape (445, 5). (I randomly split the data into train and validation set for you, you can do that easily via https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
The way you are supposed to fit your classifier is the following
gauss = GaussianNB()
multi_target_gauss = MultiOutputClassifier(gauss, n_jobs=-1)
multi_target_gauss.fit(X_train, y_train)
multi_target_gauss.predict(X_test)
to get your predictions