I have a dataset:
's002' to 's057' are the labels (Y)
I am reading the dataset using pandas:
data = pd.read_csv('data.csv').values
then, I am preparing inputs and outputs:
# preparing inputs
X = []
for i in range(0, len(data)):
# preparing outputs
y = []
for i in range(0, len(data)):
I am also using OneHotEncoder:
# one hot encoding
enc = OneHotEncoder()
y = enc.transform(y).toarray()
After all these, I am splitting and converting data:
# splitting data -> train 70%, test 15%, validation 15% (total 20400)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.15,
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
# converting list to ndarray and converting datatypes
X_train = np.asarray(X_train, dtype=np.float)
X_test = np.asarray(X_test, dtype=np.float)
X_val = np.asarray(X_val, dtype=np.float)
y_train = np.asarray(y_train, dtype=np.uint8)
y_test = np.asarray(y_test, dtype=np.uint8)
y_val = np.asarray(y_val, dtype=np.uint8)
I can use one-hot encoded labels in Neural Networks
and KNN
without any failure.
Here is my KNN classification code:
# create model
model = KNeighborsClassifier(metric="manhattan", n_neighbors=1)
# training
model.fit(X_train, y_train)
# testing
y_pred = model.predict(X_test)
print(">>> Accuracy Score (%)")
print(accuracy_score(y_test, y_pred, normalize=False) / len(y_test) * 100, '\n')
print(">>> Classification Report")
print(classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1)))
But, when I use one hot encoded labels with GaussianNB, I get ValueError: bad input shape ()
Here is the code:
# create model
model = GaussianNB()
# training
model.fit(X_train, y_train)
The output:
ValueError Traceback (most recent call last)
<ipython-input-39-e0823d0910ae> in <module>()
3 # training
----> 4 model.fit(X_train, y_train)
6 # testing
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
795 return np.ravel(y)
--> 797 raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (14280, 51)
I couldn't find why I am getting this error.
I can use GaussianNB with inversing one-hot encoded labels before creating model:
# inverse one hot encoding
y_train = enc.inverse_transform(y_train)
y_test = enc.inverse_transform(y_test)
but, I get a data conversion warning and 67% accuracy despite other models are 80%:
>>> Accuracy Score (%)
>>> Classification Report
precision recall f1-score support
s002 0.22 0.34 0.27 71
s003 0.75 0.74 0.74 57
s004 0.61 0.74 0.67 54
s005 0.60 0.74 0.66 62
s007 0.53 0.79 0.64 63
s008 0.37 0.74 0.50 66
s010 0.87 0.93 0.90 56
s011 0.64 0.82 0.72 60
s012 0.62 0.76 0.68 62
s013 0.63 0.80 0.70 59
s015 0.67 0.62 0.65 56
s016 0.56 0.68 0.62 53
s017 0.83 0.80 0.81 54
s018 0.75 0.53 0.62 62
s019 0.90 0.83 0.87 66
s020 0.60 0.25 0.35 61
s021 0.58 0.50 0.54 50
s022 0.90 0.99 0.94 76
s024 0.86 0.75 0.80 51
s025 0.82 0.90 0.86 50
s026 0.93 0.76 0.84 68
s027 0.83 0.72 0.77 75
s028 0.84 0.88 0.86 49
s029 0.78 0.77 0.77 69
s030 0.79 0.77 0.78 62
s031 0.31 0.23 0.26 66
s032 0.26 0.08 0.12 63
s033 0.71 0.96 0.82 55
s034 0.72 0.34 0.46 67
s035 0.85 0.42 0.56 67
s036 1.00 0.98 0.99 61
s037 0.59 0.42 0.49 64
s038 0.64 0.45 0.53 64
s039 0.93 0.49 0.64 55
s040 0.80 0.71 0.75 62
s041 0.70 0.62 0.66 50
s042 0.97 0.91 0.94 64
s043 1.00 0.90 0.94 67
s044 0.71 0.80 0.75 50
s046 0.40 0.33 0.36 55
s047 0.40 0.56 0.47 54
s048 0.45 0.72 0.56 54
s049 0.65 0.46 0.53 68
s050 0.57 0.55 0.56 53
s051 0.52 0.76 0.62 54
s052 0.98 0.93 0.95 57
s053 0.98 0.89 0.93 55
s054 0.50 0.71 0.58 70
s055 0.98 0.85 0.91 62
s056 0.52 0.65 0.58 49
s057 0.74 0.60 0.66 62
accuracy 0.67 3060
macro avg 0.69 0.68 0.67 3060
weighted avg 0.69 0.67 0.67 3060
/usr/local/lib/python3.6/dist-packages/sklearn/naive_bayes.py:206: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Can I use one-hot encoded labels in sklearn GaussianNB? Where am I making a mistake? What is the solution?
Thank you for your help!
Because fit
expects the numeric labels not one-hot-encoded labels.
Just remove this part.
# one hot encoding
enc = OneHotEncoder()
y = enc.transform(y).toarray()
Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
fit(self, X, y, sample_weight=None)[source]
Fit Gaussian Naive Bayes according to X, y
Xarray-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples and n_features is the number of features.
yarray-like, shape (n_samples,)
Target values.
sample_weightarray-like, shape (n_samples,), optional (default=None)
Weights applied to individual samples (1. for unweighted).
New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.