Search code examples
pythonpandasscikit-learnnaivebayesmultinomial

sklearn naive bayes MultinomialNB: Why do I get only one array with coefficients although I have 2 classes?


I have trained a naive bayes MultinomialNB model to predict if an SMS is spam or not.

I get 2 classes as expected:

nb = MultinomialNB(alpha=0.0)
nb.fit(X_train, y_train)

print(nb.classes_)
#Output: ['ham' 'spam']

but when I output the coefficients I get only 1 array.

print(nb.coef_)
#Output: [[ -7.33025958  -6.48296172 -32.55333508 ...  -9.52748415 -32.55333508
  -32.55333508]]

I have already done the same with another dataset. There were 5 instead of 2 classes, it worked and I got a matrix with 5 arrays.

Here is the whole code:

sms = pd.read_csv("spam-sms.csv", header=0, encoding = "ISO-8859-1")

X = sms.iloc[:, 1].values
X_clean = X[pd.notnull(X)]
y = sms.iloc[:,0].values
y_clean = y[pd.notnull(y)]



vectorizer = CountVectorizer()
X_cnt = vectorizer.fit_transform(X_clean)

X_train, X_test, y_train, y_test = train_test_split(X_cnt, y_clean,
test_size=0.2, random_state=0)

nb = MultinomialNB(alpha=0.0)
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

print(nb.coef_)
print(nb.classes_)

Link to dataset

And here the code where it works with 5 classes

reviews = pd.read_csv("amazon-unlocked-mobile.csv", encoding='utf-8')
X = reviews.iloc[:,4].values
X_clean = X[pd.notnull(X)]
y = reviews.iloc[:,3].values
y_clean = y[pd.notnull(X)]

vectorizer = CountVectorizer()
X_cnt = vectorizer.fit_transform(X_clean)

X_train, X_test, y_train, y_test = train_test_split(X_cnt, y_clean,
test_size=0.2, random_state=0)

nb = MultinomialNB(alpha=0.0)
nb.fit(X_train, y_train)

y_predicted = nb.predict(X_test)

print(nb.coef_)
print(nb.classes_)

Linkt to dataset


Solution

  • TL;DR:
    Access the feature_log_prob_ attribute to retrieve all log probabilities of features for all classes. The coef_ is mirroring those but returns only the values for class 1 (True) in the binary case.


    The problem with MultinomialNB is that it is not a linear classifier and actually does not compute coefficients to determine a decision function. It works by computing the conditional probabilities of a sample being a certain class given that you have a feature vector with certain values. The class with the highest probability is then considered the most likely class.

    Linear models, like LogisticRegression, state the following for their coef_ attribute:

    coef_: ndarray of shape (1, n_features) or (n_classes, n_features)
    Coefficient of the features in the decision function.

    coef_ is of shape (1, n_features) when the given problem is binary.

    For compatibility reasons, MultinomialNB still has a coef_ attribute and apparently will also return an array of shape (1, n_features) in a binary case like yours.

    But to understand what it actually returns, you should take a look at the documentation:

    coef_: ndarray of shape (n_classes, n_features)
    Mirrors feature_log_prob_ for interpreting MultinomialNB as a linear model.

    This means what you are actually seeing in coef_ of MultinomialNB are the logarithms of the probabilities associated with each feature given a certain class. Or, more precisely:

    feature_log_prob_: ndarray of shape (n_classes, n_features)
    Empirical log probability of features given a class, P(x_i|y).

    Since coef_ is just a mirror of feature_log_prob_, you can get all these log probabilities by accessing feature_log_prob_ instead:

    from sklearn.naive_bayes import MultinomialNB
    import numpy as np
    import random
    
    
    random.seed(10)
    
    X = np.array([1, 0, 0]*100).reshape(-1, 3)
    y = np.array([random.choice([0, 1]) for _ in range(100)])
    
    clf = MultinomialNB()
    clf.fit(X, y)
    
    print(clf.coef_)
    
    >>> [[-0.03571808 -4.04305127 -4.04305127]]
    
    print(clf.feature_log_prob_)
    
    >>> [[-0.0416727  -3.8918203  -3.8918203 ]
     [-0.03571808 -4.04305127 -4.04305127]]
    

    In the example, you can see that coef_ only returned the log probabilities for class 1, while feature_log_prob_ returned them for both 0 and 1.

    It is but very important to understand what these values represent and that they are different from coefficients of linear models. Therefore, depending on your particular use case, they might or might not be useful.

    I think the documentation could have done better on this. But it won't be an issue in the future I guess:

    Deprecated since version 0.24: coef_ is deprecated in 0.24 and will be removed in 1.1 (renaming of 0.26)