I have trained a naive bayes MultinomialNB model to predict if an SMS is spam or not.
I get 2 classes as expected:
nb = MultinomialNB(alpha=0.0)
nb.fit(X_train, y_train)
print(nb.classes_)
#Output: ['ham' 'spam']
but when I output the coefficients I get only 1 array.
print(nb.coef_)
#Output: [[ -7.33025958 -6.48296172 -32.55333508 ... -9.52748415 -32.55333508
-32.55333508]]
I have already done the same with another dataset. There were 5 instead of 2 classes, it worked and I got a matrix with 5 arrays.
Here is the whole code:
sms = pd.read_csv("spam-sms.csv", header=0, encoding = "ISO-8859-1")
X = sms.iloc[:, 1].values
X_clean = X[pd.notnull(X)]
y = sms.iloc[:,0].values
y_clean = y[pd.notnull(y)]
vectorizer = CountVectorizer()
X_cnt = vectorizer.fit_transform(X_clean)
X_train, X_test, y_train, y_test = train_test_split(X_cnt, y_clean,
test_size=0.2, random_state=0)
nb = MultinomialNB(alpha=0.0)
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(nb.coef_)
print(nb.classes_)
And here the code where it works with 5 classes
reviews = pd.read_csv("amazon-unlocked-mobile.csv", encoding='utf-8')
X = reviews.iloc[:,4].values
X_clean = X[pd.notnull(X)]
y = reviews.iloc[:,3].values
y_clean = y[pd.notnull(X)]
vectorizer = CountVectorizer()
X_cnt = vectorizer.fit_transform(X_clean)
X_train, X_test, y_train, y_test = train_test_split(X_cnt, y_clean,
test_size=0.2, random_state=0)
nb = MultinomialNB(alpha=0.0)
nb.fit(X_train, y_train)
y_predicted = nb.predict(X_test)
print(nb.coef_)
print(nb.classes_)
TL;DR:
Access the feature_log_prob_
attribute to retrieve all log probabilities of features for all classes. The coef_
is mirroring those but returns only the values for class 1 (True) in the binary case.
The problem with MultinomialNB
is that it is not a linear classifier and actually does not compute coefficients to determine a decision function. It works by computing the conditional probabilities of a sample being a certain class given that you have a feature vector with certain values. The class with the highest probability is then considered the most likely class.
Linear models, like LogisticRegression
, state the following for their coef_
attribute:
coef_: ndarray of shape (1, n_features) or (n_classes, n_features)
Coefficient of the features in the decision function.
coef_
is of shape(1, n_features)
when the given problem is binary.
For compatibility reasons, MultinomialNB
still has a coef_
attribute and apparently will also return an array of shape (1, n_features)
in a binary case like yours.
But to understand what it actually returns, you should take a look at the documentation:
coef_: ndarray of shape (n_classes, n_features)
Mirrorsfeature_log_prob_
for interpreting MultinomialNB as a linear model.
This means what you are actually seeing in coef_
of MultinomialNB
are the logarithms of the probabilities associated with each feature given a certain class. Or, more precisely:
feature_log_prob_: ndarray of shape (n_classes, n_features)
Empirical log probability of features given a class,P(x_i|y)
.
Since coef_
is just a mirror of feature_log_prob_
, you can get all these log probabilities by accessing feature_log_prob_
instead:
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import random
random.seed(10)
X = np.array([1, 0, 0]*100).reshape(-1, 3)
y = np.array([random.choice([0, 1]) for _ in range(100)])
clf = MultinomialNB()
clf.fit(X, y)
print(clf.coef_)
>>> [[-0.03571808 -4.04305127 -4.04305127]]
print(clf.feature_log_prob_)
>>> [[-0.0416727 -3.8918203 -3.8918203 ]
[-0.03571808 -4.04305127 -4.04305127]]
In the example, you can see that coef_
only returned the log probabilities for class 1, while feature_log_prob_
returned them for both 0 and 1.
It is but very important to understand what these values represent and that they are different from coefficients of linear models. Therefore, depending on your particular use case, they might or might not be useful.
I think the documentation could have done better on this. But it won't be an issue in the future I guess:
Deprecated since version 0.24:
coef_
is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26)